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Quantifier-Elimination for the First-Order 
Theory of Boolean Algebras with Linear 
Cardinality Constraints* 



Peter Revesz 

Department of Computer Science and Engineering 
University of Nebraska-Lincoln, Lincoln, NE 68588, USA 
reveszOcse . uni . edu 



Abstract. We present for the first-order theory of atomic Boolean al- 
gebras of sets with linear cardinality constraints a quantifier elimination 
algorithm. In the case of atomic Boolean algebras of sets, this is a new 
generalization of Boole’s well-known variable elimination method for con- 
junctions of Boolean equality constraints. We also explain the connec- 
tion of this new logical result with the evaluation of relational calculus 
queries on constraint databases that contain Boolean linear cardinality 
constraints. 



1 Introduction 

Constraint relations, usually in the form of semi-algebraic or semi-linear sets, 
provide a natural description of data in many problems. What programming 
language could be designed to incorporate such constraint relations? Jaffar and 
Lassez [27] proposed in a landmark paper constraint logic programming, with the 
idea of extending Prolog, and its usual top-down evaluation style, with constraint 
solving (which replaces unification in Prolog). That is, in each rule application 
after the substitution of subgoals by constraint tuples, the evaluation needs 
to test the satisfiability of the constraints, and proceed forward or backtrack 
according to the result. 

As an alternative way of incorporating constraint relations, in a database 
framework, Kanellakis, Kuper, and Revesz [29,30] proposed constraint query 
languages as an extension of relational calculus with the basic insight that the 
evaluation of relational calculus queries on X-type constraint relations reduces 
to a quantifier elimination [15,16] in the first-order theory of X.^ As an exam- 
ple from [29,30], if the constraint relations are semi-algebraic relations, then 

* This work was supported in part by USA National Science Foundation grant EIA- 
0091530. 

^ They also considered various bottom-up methods of evaluating Datalog queries and 
the computational complexity and expressive power of constraint queries. Since rela- 
tional calculus is closely tied with practical query languages like SQL, it has captured 
the most attention in the database area. 
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the quantifier elimination for real closed fields [2,3,12,13,40,54] can be used to 
evaluate queries. 

There are advantages and disadvantages of both styles of program/query eval- 
uations. Constraint logic programs have the advantage that their implementation 
can be based on only constraint satisfiability testing, which is usually easier and 
faster than quantifier elimination required by constraint relational calculus. On 
the other hand, the termination of constraint logic programs is not guaranteed, 
except in cases with a limited expressive power. For example, for negation-free 
Datalog queries with integer (gap)-order constraints the termination of both the 
tuple-recognition problem [14] and the least fixed point query evaluation [41, 
42] can be guaranteed. When either negation or addition constraints are also 
allowed, then termination cannot be guaranteed. In contrast, the evaluation of 
constraint relational calculus queries have a guaranteed termination, provided 
there is a suitable effective quantifier elimination method. 

While many other comparisons can be made (see the surveys [28,43] and 
the books [31,33,47]), these seem to be the most important. Their importance 
becomes clear when we consider the expected users. Professional programmers 
can write any software in any programming language and everything could be 
neatly hidden (usually under some kind of options menu) from the users. In 
contrast, database systems provide for the users not ready-made programs but 
a easy-to-use high-level programming language, in which they can write their 
own simple programs. It is unthinkable that this programming language not 
terminate, and, in fact, run efficiently. Therefore constraint database research 
focused on the efficient evaluation of simple non-recursive query languages.^ 

The constraint database field made initially a rapid progress by taking off- 
the-shelf some quantifier elimination methods. Semi-linear sets as constraint re- 
lations are allowed in several prototype constraint database systems [8,21,24,25, 
48,49] that use Fourier-Motzkin quantifier elimination for linear inequality con- 
straints [18]. The latest version of the DISCO system [9,50] implements Boolean 
equality constraints using Boole’s existential quantifier elimination method for 
conjunctions of Boolean equality constraints. 

Relational algebra queries were considered in [20,42,46]. As in relational 
databases, the algebraic operators are essential for the efficient evaluation of 
queries. In fact, in the above systems logical expressions in the form of relational 
calculus, SQL, and Datalog rules are translated into relational algebra. 

There were also deep and interesting questions about the relative expressive 
power and computational complexity of relational versus constraint query lan- 
guages. Some results in this area include [4,23,37,45] and a nice survey of these 
can be found in Chapters 3 and 4 of [31]. 



^ Of course, many database-based products also provide menus to the users. However, 
the users of database-based products are only indirect users of database systems. 
The direct users of database systems are application developers, who routinely embed 
SQL expressions into their programs. Thanks to today’s database systems, they need 
to worry less about termination, efficiency, and many other issues than yesterday’s 
programmers needed while developing software products. 
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After these initial successes, it became clear that further progress may be 
possible only by extending the quantifier elimination methods. Hence researchers 
who happily got their hands dirty doing implementations found themselves back 
at the mathematical drawing table. 

The limitations of quantifier elimination seemed to be most poignant for 
Boolean algebras. It turns out that for conjunctions of Boolean equality and 
inequality constraints (which seems to require just a slight extension of Boole’s 
method) no quantifier elimination is possible. Let us see an example, phrased as 
a lemma. 

Lemma 1. There is no quantifier-free formula of Boolean equality and inequal- 
ity eonstraints that is equivalent in every Boolean algebra to the following for- 
mula: 

3d (d n 5 7^ _L) A (d n g yf _L) 

where d and g are variables and _L is the zero element of the Boolean algebra. □ 

Consider the Boolean algebra of sets, with the one element being the names of 
all persons, the zero element being the empty set, the domain being the powerset 
(set of all subset of the one element). 

In the formula variable d may be the set of students who took a database 
systems class, variable g may be the set of students who graduate this semester. 
Then the formula expresses the statement that “some graduating student took 
a database systems class, and some graduating student did not take a database 
systems class.” This formula implies that g has at least two elements, that is, 
the cardinality of g is at least two, denoted as: 

IffI >2 

But this fact can not be expressed by any quantifier-free formula with Boolean 
equality and inequality constraints and g as the only variable. 

Lemma 1 implies that there is no general quantifier elimination method for 
formulas of Boolean equality and inequality constraints. This negative result was 
noted by several researchers, who then advocated approximations. For example. 
Helm et al. [26] approximate the result by a formula of Boolean equality and 
inequality constraints. Can we do better than just an approximation? 

The only hopeful development in the quantifier elimination area was by Mar- 
riott and Odersky [32] who showed that formulas with equality and inequality 
constraints admit quantifier elimination for the special case of atomless Boolean 
algebras. However, many Boolean algebras are not atomless but atomic. How 
can we deal with those Boolean algebras? Could any subset of atomic Boolean 
algebras also admit quantifier elimination? In this paper we show that the atomic 
Boolean algebras of sets, i.e.. Boolean algebras where the Boolean algebra op- 
erators are interpreted as the usual set operators of union, intersection and 
complement with respect to the one element, also admit quantifier elimination, 
in spite of the pessimistic looking result of Lemma 1. 
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Let us take a closer look at the Lemma. Surprisingly, the condition \g\ >2 
is not only necessary, but it is also sufficient. That is, for any Boolean algebra 
of sets if G is any set with at least two elements, then we can find a set D such 
that the above formula holds. Therefore, jgl > 2 is exactly the quantifier-free 
formula that we would like to have as a result of the quantifier elimination. 
However, quantifier elimination techniques are normally required to give back 
equivalent quantifier- free formulas with the same type of constraints as the input. 
This condition is commonly called being closed under the set of constraints. This 
raises the interesting question of what happens if we allow cardinality constraints 
in our formulas. 

While cardinality constraints on sets are considered by many authors, and 
interesting algorithms are developed for testing the satisfiability of a conjunction 
of cardinality constraints, there were, to our knowledge, no algorithms given 
for quantifier elimination for atomic Boolean algebras of sets with cardinality 
constraints. 

Calvanese and Lenzerini [11,10] study cardinality constraints that occur in 
ER-diagrams and ISA hierarchies. They give a method to test the satisfiability 
of a schema. This is a special case of cardinality constraints, because the ER- 
diagrams do not contain inequality constraints. 

Ohlbach and Koehler [35,36] consider a simple description logic with cardi- 
nality constraints. They give methods to test subsumption and satisfiability of 
their formulas, but they do not consider quantifier elimination. 

Seipel and Geske [51] use constraint logic programming to solve conjunctions 
of cardinality constraints. Their set of constraint logic programming [27] rules is 
sound but incomplete. 

Surprisingly, in this paper, we show that the augmented formulas, called 
Boolean linear cardinality constraint formulas, admit quantifier elimination. It 
is surprising that by adding to the set of atomic constraints, the problem of 
quantifier elimination becomes easier, not harder. Indeed, the end result, which 
is our quantifier elimination method described in this paper, may strike the 
reader as simple. But the finding of the trick of adding cardinality constraints 
for the sake of performing quantifier elimination is not obvious as shown by the 
following history summarized in Figure 1. In the figure the arrows point from less 
to more expressive Boolean constraint theories, but the labels on them indicate 
that the Boolean algebra needs to be of a certain type. Let’s describe Figure 1 
in some detail (please see Section 2 for definitions of unfamiliar terms) . 

Precedence between variables: A naive elimination of variables from a 
set of Boolean precedence constraints between variables or constants (in the case 
of algebras of sets set containment constraints between sets) occurs in syllogistic 
reasoning. Namely, the syllogistic rule if all x are y, and all y are z, then all x 
are z yields a simple elimination of the y variable. Such syllogisms were described 
already by Aristotle and developed further in medieval times and can be used 
as the basis of eliminating variables. Srivastava, Ramakrishnan, and Revesz [52] 
gave an existential quantifier elimination method for a special subset of the 
Boolean algebra of sets. They considered existentially quantified formulas with 
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Fig. 1. Quantifier elimination methods for Boolean theories. 



the quantifier-free part being only a conjunction of atomic constraints of the 
form V Q w, where v and w are constants or variables ranging over the elements 
of a Boolean algebra of sets. Gervet [19] independently derived a similar method 
about the same time. This shows that the problem of quantifier elimination for 
sets arises naturally in different contexts. 

Boolean equality: The quantifier elimination procedure for Boolean for- 
mulas with equality constraints was given by George Boole in 1856. His method 
allows the elimination of only existentially quantified variables. 

Boolean order: Revesz [44,47] shows that Boolean order constraints allow 
existential quantifier elimination. Boolean order constraints (see Section 2.2) are 
equality constraints of the form x FI = -L £^nd inequality constraints of the 
form tmon ^ -L where x is a variable and t is a monotone Boolean term, that is, a 
term that contains only the fl and U operators. In Boolean algebras of sets, the 
first of these constraints can be written as tmon 2 x. This is clearly a general- 
ization of precedence constraints between variables. However, it is incomparable 
with Boolean equalities. This theory contains some inequality constraints (which 
cannot be expressed by equality constraints), namely when the left hand side 
is a monotone term, but it cannot express all kinds of equality constraints, but 
only those that have the form A x = T. 

Boolean equality and inequality: Marriott and Odersky [32] show that 
atomless Boolean algebras admit both existential and universal quantifier elim- 
ination. Glearly, their method is a generalization of Boole’s method in the case 
of atomless Boolean algebras. 
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Boolean linear cardinality: The quantifier elimination for this is intro- 
duced in this paper. In the case of atomic Boolean algebras of sets, the new 
linear cardinality constraints quantifier elimination is another generalization of 
the quantifier elimination considered by Boole as shown by Lemma 3. 

The outline of this paper is as follows. Section 2 gives some basic defini- 
tions regarding Boolean algebras, constraints, and theories. Section 3 describes 
a new quantifier elimination method for conjunctions of Boolean cardinality con- 
straints. Section 4 uses the quantifier elimination method for the evaluation of 
relational calculus queries. Finally, Section 5 gives some conclusions and direc- 
tions for further research. 

2 Basic Definitions 

We define Boolean algebras in Section 2.1, Boolean constraints in Section 2.2, 
and Boolean theories in Section 2.3. 

2.1 Boolean Algebras 

Definition 1. A Boolean algebra B is a tuple (5, □, U,' , T, T), where S is a 
non-empty set called the domain; FI, U are binary functions from 6 x S to 5; ' is 
a unary function from 6 to S; and T, T are two specific elements of 6 (called the 
zero element and the one element, respectively) such that for any elements x, y, 
and z in 5 the following axioms hold: 

X U y = y U X xFly 

xUfyU z) = {xUy)n{xU z) xH (y\J z) 
xUx' = T xUx' 

xU Jl = X xFlT 

For Boolean algebras we define the precedence relation, denoted as □, by the 
following identity: 

X A y means x' Ay = 1. 

We also write the above as y Q x and say that y precedes x. 

The above gives a formal definition of a Boolean algebra. It can be considered its 
syntax. The semantics of a Boolean algebra is given through interpretations for 
the elements of its structure. Without going into deep details about the numerous 
possible interpretations, we give one common interpretation of the domain and 
operators. 

Definition 2. A Boolean algebra of sets is any Boolean algebra, where: 

S is a set of sets, 

U is interpreted as set union, denoted as U, 
n is interpreted as set intersection, denoted as fl, 

' is interpreted as set complement with respect to T, denoted as ~, and 
Q (or A) is interpreted as set containment, denoted as Q (or A). 



= yn X 

= (x n y) U (x n z) 
= T 

= X 
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Note: We call two Boolean algebras isomorphic if there is between them a 
bijection which preserves their respective Boolean operations. By Stone’s repre- 
sentation theorem, any Boolean algebra is isomorphic to a Boolean algebra of 
sets [53]. Hence restricting our attention in this paper to Boolean algebras of 
sets is without any significant loss of generality. 

An atom of a Boolean algebra is an element x ^ 1. such that there is no 
other element y ^ 1. with y can happen that there are no atoms at all in 

a Boolean algebra. In that case, we call the Boolean algebra atomless; otherwise 
we call it atomic. 

Let J\f denote the set of non-negative integer, Z denote the set of integer, 
and Q denote the set of rational numbers. 

Example 1. 

Bz = (Powerset(Z), n, U,“ , 0, Z) 

is a Boolean algebra of sets. In this algebra for each i G Z the singleton set {f} 
is an atom. This algebra is atomic. □ 

Example 2. Let H be the set of all finite unions of half-open intervals of the 
form [a, h) over the rational numbers, where [a, b) means all rational numbers 
that are greater than or equal to a and less than b, where a is a rational number 
or — oo and 6 is a rational number. Then: 

Bh = (iL,n,u,-,0,Q) 

is another Boolean algebra of sets. This algebra is atomless. □ 

2.2 Boolean Constraints 

In the following we consider only atomic Boolean algebras of sets with either a 
finite or a countably infinite number of atoms. For example, the Boolean algebra 
Bz has a countably infinite number of atoms. We also assume that we can take 
the union or the intersection of an infinite number of elements of the Boolean 
algebra, i.e., our Boolean algebras are complete. 

Cardinality can be defined as follows. 

Definition 3. Let x he any element of an atomic Boolean algebra of sets with 
a finite or countably infinite number of atoms. If x ean be written as the finite 
union of n G Af number of distinet atoms, then the cardinality ofx, denoted |a;|, 
is n. Otherwise the cardinality of x is infinite, whieh is denoted as -l-oo. 

The following lemma shows that the cardinality is a well-defined function 
from elements of an atomic Boolean algebra to N U {-|-oo}. 

Lemma 2. Let x he any element of an atomic Boolean algebra of sets with 
a finite or countably infinite number of atoms. Then x can be written as the 
union of some n G Af or -|-oo number of distinct atoms in the Boolean algebra. 
Moreover, there is no other set of atoms whose union is equivalent to x. □ 
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For example, in the Boolean algebra Bz we can write {1, 2} as {1} U {2} or 
as {2} U {!}. In either way, we use the same two distinct atoms, i.e., {1} and 
{2}. It follows from Lemma 2 that it is enough to allow only atomic constant 
symbols in Boolean terms and Boolean eonstraints, which we define as follows. 

Definition 4. Let B = {S, fl, U,- , _L, T) be an atomic Boolean algebra of sets. A 
Boolean term over B is an expression built from variables ranging over d, atomic 
constants denoting particular atoms of B, _L denoting the zero element, T denot- 
ing the one element, and the operators for intersection, union, and complement. 

A Boolean term is monotone if it is built without the use of the complement 
operator. 

Definition 5. Boolean constraints have the form: 

Equality : t = _L 

Inequality : ^ -L 

Order : tmon yf -L or tmon 3 x 

Linear Cardinality: ci|fi| -I- . . . -I- Cfc|tfc| 0 b 

where t and each ti for 1 < i < A: is a Boolean term, tmon is a monotone Boolean 
term, x is a Boolean variable, each Ci for 1 < z < fc and b is an integer constant 
and 9 is: 

= for the equality relation, 

> for the greater than or equal comparison operator, 

< for the less than or equal comparison operator, or 

=n for the congruence relation modulus some positive integer constant n. 

2.3 Boolean Theories 

A formula of Boolean algebras of sets is a formula that is built in the usual way 
from the existential quantifier 3, the universal quantifier V, the logical connec- 
tives A for and V for or, the apostrophe ' for negation^ and one of the above 
types of Boolean algebra constraints. 

A solution or model of a Boolean algebra constraint (or formula) is an as- 
signment of the (free) variables by elements of the Boolean algebra such that 
the constraint (or formula) is satisfied. A constraint (or formula) is true if every 
possible assignment is a solution. 

Example 3. Consider the Boolean algebra Bz of Example 1. Then the Boolean 
linear cardinality constraint: 



3|x n y| — 2\z\ = 4 

has many solutions. For example x = {3,6,9} and y = {3, 4, 5, 6} and 2 = {1} 
is a solution. The Boolean linear cardinality constraint: 

|xU {1} U (2}| > 2 

is true, because every assignment to x is a solution. □ 
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Example 4- Suppose that we know the following facts about a company: 

1. The number of salespersons is a multiple of 6. 

2. The number of engineers is a multiple of 9. 

3. There are seven employees who are both salespersons and engineers but not 
managers. 

4. There are twice as many managers who are salespersons than managers who 
are engineers. 

5. Each manager is either a salesperson or an engineer but not both. 

Using variables x for managers, y for salespersons, and z for engineers, we 
can express the above using the following Boolean cardinality formula S: 

\y\ =6 0 A 

\z\ =9 0 A 
n y n z| = 7 A 
|x n t/l — 2|x n zj = 0 A 
|x n y n z| -I- |x n y n z| = o 



□ 



Example 5. Suppose that we know the following additional fact about the com- 
pany in Example 4: 

6. Person a is both a manager and an engineer but not a salesperson. 

Here a is a constant symbol that denotes a particular atom of the Boolean 
algebra. We can express this by the following Boolean constraint: 

|xnyrizna| = i 



□ 

In any given Boolean algebra two formulas are equivalent if they have the 
same set of models. The purpose of quantifier elimination is to rewrite a given 
formula with quantifiers into an equivalent quantifier-free formula [16]. A quan- 
tifier elimination method is closed or has a closed-form if the type of constraints 
in the formula with quantifiers and the quantifier-free formula are the same. (It 
is sometimes possible to eliminate quantifiers at the expense of introducing more 
powerful constraints, but then the quantifier elimination will not be closed-form.) 

In arithmetic theories quantifier elimination is a well-studied problem. A well- 
known theorem due to Presburger ( [38] and improvements in [6,7,17,55,56]) is 
that a closed-form quantifier elimination is possible for formulas with linear 
equations (including congruence equations modulus some fixed set of integers). 
We will use this powerful theorem in this paper. 

The following lemma shows in the case of atomic Boolean algebras of sets a 
simple translation from formulas with only equality and inequality constraints 
into formulas with only linear cardinality constraints. 
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Lemma 3. In any atomic Boolean algebra of sets, for every term t we have the 
following: 

t = _L if and only if \t\ = 0 
t yf _L if and only if \t\ > 1 

Moreover, using the above identities, any formula with only equality and in- 
equality constraints can be written as a formula with only linear cardinality con- 
straints. □ 

Finally, we introduce some useful technical definitions related to Boolean 
theories and formulas. 

Let F’ be a formula that contains the variable and atomic constant symbols 
z\, . . . , Zn- Then the minterms of F, denoted Minterm(F), are the set of ex- 
pressions of the form Ci n . . . □ where each Q is either Zi or that is, each 
minterm must contain each variable and atomic constant symbol either posi- 
tively or negatively. Note that there are exactly 2" minterms of F. We can order 
the minterms in a lexicographic order assuming that positive literals precede 
negative literals. 

Example 6. Suppose that we use only the following variables and no constant 
symbols in a formula: x, y, z. Then we can form from these variables the following 
eight minterms in order: 

xDyDz, xflyflz, xDyDz, xDyDz, 
xDyDz, xDyDz, xr\yr\z, xDyDz 

If we also use the atomic constant symbol a, then we can form the following 
minterms: 



X n y n z n a, 
x r\y r\ z D a, 
X Dy r\ z D a, 
X Dy r\ z D a, 



X Dy n z nd, 
X r\y r\ z r\d, 
X r\y r\ z r\d, 
X r\y r\ z r\d, 



x^\yf^z^\a 
a; n y n z n a 
X r\ y Dz n a 
X Dy r\z r\ a 



xOgOzDa. 
X (ly (Iz (la. 
X D y r\z Da. 
X (ly (Iz (Id 



□ 

Note that any minterm that contains two or more atoms positively is equiv- 
alent to T. This allows some simplifications in certain cases. 

Each Boolean cardinality constraint with n variables and atomic constants 
can be put into the normal form: 



Ci|toi| -I- . . . -I- C2"|to2"| 0 b (1) 

where b and each Ci is an integer constant and each is a minterm of the 
Boolean algebra for 1 < i < 2”, and 9 is as in Definition 5. 
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Example 7. We rewrite S of Example 4 into the following normal form (omitting 
minterms with zero coefficients and the A symbol at the end of lines) : 

\x C\ y C\ z\ + \x C\ y C\^\ + \x C\ y C\ z\ + \x C\ y C\^\ =Q Q 
jxnyrizj -I- ja;nynzj -I- Isrij/n^j -i- \xc\yr\ z\ =g 0 
\xC\yr\ z\ + \xr\yr\ z\ =7 

\xC\y r\ z\ — \x C\y C\^\ + 2\x C\y C\ z\ = 0 

\xC\yr\ z\ + \xr\yC\^\ =0 

The constraint in Example 5 is already in normal form. 

3 Quantifier Elimination Method 

We give below a quantifier elimination algorithm for Boolean linear cardinality 
constraint formulas in atomic Boolean algebras of sets. 

Theorem 1. [constant-free case] Existentially quantified variables can be 
eliminated from Boolean linear cardinality constraint formulas. The quantifier 
elimination is closed, that is, yields a quantifier-free Boolean linear cardinality 
constraint formula. □ 

Example 8. Let S be the Boolean cardinality formula in Example 4. A quantifier 
elimination problem would be to find a quantifier-free formula that is logically 
equivalent to the following: 

3xS 

First we put S into a normal form as shown in Example 7. Then let S* be the 
conjunction of the normal form and the following: 

\y C\ z\ — \x C\y C\ z\ — \x C\y C\ z\ = Q 
|y (T z| — ja; (T y (T z| — |x (T y (T z| = 0 

jynzj — jxnyn^j — jxnynzj = o 

|y n z| — ja; n y n z| — |x (T y (T z| = 0 



Second, we consider each expression that is the cardinality of a minterm (over 
x,y,z or over y,z) as an integer variable. Then by integer linear constraint 
variable elimination we get: 



lyn^l -h Iynz| =6 0 

lyn^l -h |yri2;| =g 0 

|ynz| =7 

By Theorem 1, the above is equivalent to 3a; S. For instance, the following 
is one solution for the above: 

|y(Tz| = 7 
|y (Tz| = 5 
\yC\z\ = 2 
|y nz| = 0 
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Fig. 2. Venn diagram for the employee example. 



Corresponding to this we can find the solution: 



|x n y n z| = 0 
|x n y n z| = 2 
|x n y n z| = 1 

\xC\yC\z\ = 0 



|x n y n z| = 7 
\xC\yr\~z\ = 3 
|x n y n 2;| = 1 
\x C\y C\~z\ = Q 



Finally, given any assignment of sets to y and z such that the first group of 
equalities holds, then we can find an assignment to x such that the second set of 
equalities also holds. While this cannot be illustrated for all possible assignments, 
consider just one assignment shown in Figure 2 where each dot represents a 
distinct person. Given the sets y and z (shown with solid lines), we can find a set 
X (shown in Figure 2 with a dashed line) that satisfies the required constraints. 

□ 



In the constant-free case, once we know the cardinality of each minterm, any 
assignment of the required number of arbitrary distinct atoms to the minterms 
yields an assignment to the variables that satisfies the Boolean formula. 

Of course, when the minterms contain atoms, we cannot assign arbitrary 
atoms. We have to assign to a minterm that contains an atom a positively either 
the atom that a denotes or the bottom element _L. There is no other choice that 
can be allowed. 

We handle atomic constants as follows. We consider them as additional 
Boolean algebra variables that have cardinality one and the cardinality of their 
intersection is zero. That is, if we have atomic constants ai, . . . , Oc in a formula, 
then we add to system (3) the following for each 1 < j < c: 

|ai| = 1 

and also for each 1 < i, j < c and i^j. 

I Oi n ajj = 0 
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Of course, both of the above conditions have to be put into a normal form 
before adding them to system (3). Then we solve these as in Theorem 1. Now 
we can prove the following. 

Theorem 2. [with atomic constants] Existentially quantified variables ean 
be eliminated from Boolean linear eardinality constraint formulas. The quantifier 
elimination is closed, that is, yields a quantifier-free Boolean linear cardinality 
constraint formula. □ 



Example 9. Let S be the Boolean cardinality formula that is the conjunction 
of the formula in Example 4 and the constraint in Example 5. A quantifier 
elimination problem would be to find a quantifier-free formula that is logically 
equivalent to the following: 

3x S 

First we put S into a normal form. Then let S* be the conjunction of the normal 
form and the following: 

lyflzrial — |a;nj/n2;na| — |xn?/n2:na| = 0 
lyfiznaj — jxirj/n^naj — |xn?/n2:na| = o 

lyninaj — jxITj/riznal — \x C\ y C\~z r\ a\ = 0 
lyninaj — jxITj/riznal — \x C\ y C\~z r\d\ = 0 
\yC\zr\a\ — Ixriyn^naj — \x r\y r\ z r\ a\ = o 
\y n z nd\ — \x Hy n z nd\ — \x r\y r\ z r\d\ = 0 
lynzlTaj — jxlTyriznaj — \x C\y C\^ C\ a\ = 0 
|y(Tzna| — \x r\y r\z r\a\ — \x C\y C\^ C\d\ = 0 

We also add the constraint |a| = 1 expressed using minterms as: 

\xC\yC\zC\a\ -|-|xnj/nzna| -|-|xnynzna| -h|xnynzna| 

-i-|xn?/nzna| -i-|xn?/nzna| -i-|xnynzna| -hjxnynznaj = i 

Second, we consider each expression that is the cardinality of a minterm (over 
x, y, z, a or over y, z, a) as an integer variable. Then by integer linear constraint 
variable elimination we get: 

\yr\zC\a\ \yr\zr\d\ - 1 - \yC\^r\a\ -\- \yr\zr\d\ =e 0 

jy n z n a| -I- jy n z n a| -1- jy n 2 n a| -|- |y n z n a| =9 0 
jy (T z (T aj -I- jy (T z (T aj =7 

|y (T z (T a| =1 



The last linear cardinality constraint comes from the constraint in Example 5 
and the constraint that |a| = 1. By Theorem 2, the above is equivalent to 3x S. 
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For instance, the following is one solution for the above: 
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a\ - 
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II 
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zn 


aj = 


o 

II 



Corresponding to this we can find the solution: 



xC\yC\ zC\a\ = 0 

xr\yr\zr\a\ = o 

X C\y r\^C\ a\ = Q 
xr\yr\zr\a\ = 2 
xr\yr\zr\a\ = i 
xn?7nznaj = o 

X C\y C\z C\ a\ = Q 
X C\y C\z C\a\ = Q 



\xr\yC\zC\a\ = 0 
\x r\ y C\ z r\a\ = 7 
|a:n?/nzna| = 0 
|a:n?/nFna| = 3 
\x r\y r\ z r\ a\ = o 
Ixflyn^riaj = 1 
|xnpnzria| = o 
|xnpnzria| = o 



Finally, given any assignment of sets to y, z, and a such that the first group 
of equalities holds and the correct atom is assigned to a, then we can find an 
assignment to x such that the second set of equalities also holds. Again, consider 
just one assignment shown in Figure 3 where each dot represents a distinct 
person. Given the sets y, z, and a (shown with solid lines), we can find a set x 
(shown in Figure 3 with a dashed line) that satisfies the required constraints. 

It is interesting to compare the above with Example 8. There we could choose 
for X an arbitrary atom of the set y C\ z, but here we must choose the atom a. 
More precisely, we can choose an arbitrary atom of the set y fl z fl a, which, of 
course, is the same. □ 




Fig. 3. Venn diagram for Example 9. 
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Theorem 3. Universally quantified variables can be eliminated from Boolean 
linear cardinality constraint formulas. The quantifier elimination is closed, that 
is, yields a quantifier-free Boolean linear cardinality constraint formula. □ 

Given any formula, it can be decided whether it is true by successively elimi- 
nating all variables from it. Then it becomes easy to test whether the variable-free 
formula is true or false. Hence, this also shows that: 

Corollary 1. It can be decided whether a Boolean linear cardinality constraint 
formula is true or false. □ 

4 Query Evaluation 

It is well-known that several practical query languages, such as SQL without 
aggregation operators, can be translated into relational calculus [1,39]. Hence 
while relational calculus is not used directly in major database systems, many 
theoretical results are stated in terms of it with clear implications for the more 
practical query languages. Hence we will do the same here. 

A relational calculus formula is built from relation symbols Ri with variable 
and constant symbol arguments, the connectives A for and, V for or, — >■ for 
implication, and overline for negation, and the quantifiers 3 and V in the usual 
way. Each relation has a fixed arity or number of arguments. If Ri is a fc-arity 
relation, then it always occurs in the formula in the form Rfizi, . . . , Zk), where 
each Zj for 1 < j < A: is a variable or constant symbol. 

A general framework for using constraint databases is presented in [29,30]. 
The following three definitions are from that paper. 

1. A generalized k-tuple is a quantifier-free conjunction of constraints on k or- 
dered variables ranging over a domain 5. Each generalized k tuple represents 
in a finite way a possibly infinite set of regular fc-tuples. 

2. A generalized relation of arity fc is a finite set of generalized fc-tuples, with 
each fc-tuple over the same variables. 

3. A generalized database is a finite set of generalized relations. 

Let ri be the generalized relation assigned to Ri. We associate with each 
ri a formula Fr^ that is the disjunction of the set of generalized fc-tuples of r^. 
According to the above definition, is a quantifier-free disjunctive normal form 
(DNF) formula. This is not absolutely necessary, and other researchers allow 
non-DNF formulas too. The above the generalization is from finite relations 
found in relational database systems to finitely representable (via constraints) 
but possibly infinite relations. 

Let (j) be any relational calculus formula. Satisfaction with respect to a do- 
main S and database d, denoted < 5, d > ^, is defined recursively as follows: 

<(5, d>^ Rfiai, . . . ,ak) iff (ai, . . . , a^) is true 

< 6, d > \= (j) A iff <(5, d>^ 4> and < d, d > ^ ip 

< 6, d > \= (py Ip iff < S,d > \= (p or < S,d > \= ip 



(2) 

( 3 ) 

( 4 ) 
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<(5, <j) ^ 'll) iff not < (5, c? > ^ (/) or < (5, d > ^ ^ (5) 

<(5, d>^ 4> 'll) iff < S,d > \= (j) ^ 'll) and < d, d > ^ "0 —>■(/) (6) 

< (5, d > ^ 4> iff not < 5,d > \= (j) (7) 

< S,d> 1= 3xi(l) iff < 6,d> \= (l>[xi/aj] for some aj € S (8) 

< <5, d > 1= yxi4> iff < 6,d> \= (l>[xi/aj] for each aj € d (9) 

where [xi/aj\ means the instantiation of the free variable Xi by aj. 



The above semantics does not immediately suggest a query evaluation 
method.^ In relational databases, the above would suggest a query evaluation, 
because in the last two rules we clearly need to consider a finite number of cases, 
but we cannot rely on this finiteness in constraint databases. However, the fol- 
lowing alternative semantics, that is equivalent to the above, is discussed in [29, 
30]: 

Let (l>{xi, . . . ,Xm) be a relational calculus formula with free variables 
ii, . . . ,Xm- Let relation symbols i?i,. . . , i?„ in (f) be assigned the generalized 
relations ri, . . . , respectively. Let (^i = </)[i?i/ifri, ■ ■ • , R-n/Fr^] be the formula 
that is obtained by replacing in if) each relation symbol . . . , Zk) by the 

formula 

Fr, [Xi/Zi, . . .,Xk/Zk] 

where Fr-{xi, . . . ,Xk) is the formula associated with r^. Note that ipi is a simple 
first order formula of constraints, that is, it does not have any of the relation 
symbols Ri in it, hence d is no longer relevant in checking the satisfaction of 4)i. 
The output database of 4> on input database ri, . . . , is the relation 

Xout — { (^1 : ■ • ■ : ^ d > [= 01 (ui , . . . , Ctyn) }• 

The above is a possibly infinite relation that needs to be also finitely represented. 

Such a finite representation can be found by quantifier elimination. For the 
goal of quantifier elimination is to find a quantifier-free formula Fi that has the 
same models as 0i has [16]. That is, 

X out — { (^1 ; ■ • ■ ; ^m) ■ <d>]= F’i(ui,..., Uyn) } • 

Hence the alternative semantic definition yields an effective method for query 
evaluation based on quantifier elimination. Moreover, Fi can be put into a DNF 
to preserve the representation of constraint relations. 

The above general approach to query evaluation also applies when the input 
relations are described by linear cardinality constraints in atomic Boolean alge- 
bras of sets. In particular, the existential quantifier elimination in Theorem 2 
and the universal quantifier elimination in Theorem 3 can be used to evaluate 
relational calculus queries. 

® This semantic definition closely follows the usual definition of the semantics of rela- 
tional calculus queries on relational databases. Indeed, the only difference is that in 
rule (4) the usual statement is that (ai, . . . , a^) is a tuple in the relation or “a row 
in the table” instead of the tuple satisfying a formula. 
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Theorem 4. Relational calculus queries on constraint relations that are de- 
scribed using quantifier-free Boolean linear cardinality constraint formulas can 
be evaluated in closed-form. □ 

Example 10. Each strand of a DNA can be considered as a string of four letters: 
A, C, G, and T. In bioinformatics, we often have a set of probes, which are 
small already sequenced fragments from different already known parts of a long 
DNA string. For example, one probe may be located from the first to the tenth 
location on the long DNA strand and be the following: 

CATCGATCTC 

Another may be located between the eighth and 20th and be the following: 
CTCGGGAGGGATC 



and so on. 

Each of these probes can be represented in as a tuple of sets 
{xa,xc,xo,xt)^ where xc, xq, and xt are the positions where A, C, G, 
and T occur, respectively. Hence the first probe can be represented by: 

({2,6}, {1,4,8,10}, {5}, {3,7,9}) 

while the second probe can be represented by: 

({14,18}, {8,10,20}, {11,12,13,15,16,17}, {9,19}) 

Suppose we have a large number of probe data in a relation 
Probe{xA,xc,XG,XT), and we’d like to reconstruct the long DNA strand 
using the probe data. There may be some errors in the sequencing information, 
for example, a letter A on the DNA may be incorrectly indicated as C in the 
probe data. Suppose we are satisfied with a 95 percent accuracy regarding the 
probe data. Suppose also, that we know that the long DNA sequence contains 
between 5000 and 6000 letters. The following relational calculus query finds 
possible long DNA sequences (y^, ycj 2/Gi 2 /t) that contain each probe with at 
least a 95 percent accuracy. 

{\VA UycAycA yr\ - \va\ ~ \yc\ ~ \vg\ ~ \vt\ = 0) A 
(\y A Aye Aye Ayr] > 5000) A 
(lyAAgcAycAyTl < 6000) A 
(VxA, xc, XG, XT {Probe{xA, xq, xq, Xt) -a 

\xAr\yA\ + \xcr\yc\ + \xGAyG\ + \xTr\yT\ > 0.95 |xa U xc U xg U xt|)) 

In the above the solution is (yA,yc,yG,yT), which we can get by eliminating 
the universally quantified variables xa^xc^xg-iXt- The first line ensures that 
the four sets J/Aj J/Ci 2/Gi ?/t are disjoint. If they are disjoint, then the length of 
the solution is: \yA U yc U ye U ?/t|, which is restricted to be between 5000 and 
6000 letters in the second and third lines of the relational calculus query. The 
fourth and fifth lines express that for each probe its overlap with the solution 
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(left hand side of the cardinality constraint) must be greater than or equal to 95 
percent of the length of the probe (right hand side) . 

We also need to check that the solution is a continuous sequence, that is, 
there are no gaps. We can do that by defining an input relation Order{x, z), 
which will contain all tuples of the form ({i},{j}) such that 1 < i < j < 6000 
and i,j G Af. Then the following relational calculus formula tests whether the 
solution is a continuous sequence: 

3z\/x Order{x,z) O \x F\ {yA^ yc ^ VG yT)\ = 1 

The formula says that there must be a last element 2 in the solution, such that 
any x is an element of the solution, if and only if it is less than or equal to z. □ 

5 Conclusions and Future Work 

Quantifier elimination from Boolean linear cardinality constraint formulas by 
reduction to quantifier elimination in Presburger arithmetic is a new approach. 
The complexity of the quantifier elimination needs to be investigated. Especially 
the handling of (atomic) constants may be simplified. 

It is also interesting to look at more complex cardinality constraints. For 
example, one cannot express using only linear cardinality constraints that the 
cardinality of set A is always the square of the cardinality of set B. We avoided 
such constraints, because even existential quantifier elimination from integer 
polynomial equations is unsolvable in general [34]. However, with restrictions on 
the number of variables we may have an interesting solvable problem. 

Example 10 shows that relational calculus with Boolean linear cardinality 
constraints can handle some string problems. It is interesting to compare this in 
expressive power and computation complexity with query languages for string 
databases. Grahne et al. [22] proposed an extension of relational calculus with 
alignment operators for string databases, but the evaluation of their query lan- 
guage is unsolvable in general [22]. Benedikt et al [5] proposed several other 
extensions of relational calculus with various string operators. They show that 
the language S'len with only the prefix, the concatenation of a single character of 
the alphabet (at the end of a string), and the test-of-equal-length operators on 
strings does not admit quantifier elimination, although some weaker logics have 
quantifier elimination. 

Finally, there are many practical implementation questions, for example, 
defining algebraic operators for query evaluation, data structures for represent- 
ing the Boolean linear cardinality constraints, indexing for fast retrieval, and 
other issues of query optimization. 
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Abstract. Update propagation in deductive databases can be imple- 
mented by combining rule rewriting and fixpoint computation, analo- 
gous to the way how query answering is performed via Magic Sets. For 
efficiency reasons, bottom-up propagation rules have to be subject to 
Magic rewriting, thus possibly loosing stratifiability. We propose to use 
the soft stratification approach for computing the well-founded model of 
the magic propagation rules (guaranteed to be two-valued) because of 
the simplicity and efficiency of this technique. 



1 Introduction 

In the field of deductive databases, a considerable amount of research has been 
devoted to the efficient computation of induced changes by means of update 
propagation (e.g. [5,6,10,12,17]). Update propagation has been mainly studied 
in order to provide methods for efficient incremental view maintenance and in- 
tegrity checking in stratifiable databases. The results in this area are particu- 
larly relevant for systems which will implement the new SQL:1999 standard and 
hence will allow the definition of recursive views. In addition, update propagation 
methods based on bottom-up materialization seem to be particularly well suited 
for updating distributed databases or in the context of WWW applications for 
signaling changes of data sources to mediators. 

The aim of update propagation is the computation of implicit changes of 
derived relations resulting from explicitly performed updates of the extensional 
fact base. As in most cases an update will affect only a small portion of the 
database, it is rarely reasonable to compute the induced changes by compar- 
ing the entire old and new database state. Instead, the implicit modifications 
should be iteratively computed by propagating the individual updates through 
the possibly affected rules and computing their consequences. Although most 
approaches essentially apply the same propagation techniques, they mainly dif- 
fer in the way they are implemented and in the granularity of the computed 
induced updates. We will consider the smallest granularity of updates, so called 
true updates, only, in order to pose no restrictions to the range of applications 
of update propagation. Moreover, we will use deductive rules for an incremental 
description of induced updates. 
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Incremental methods for update propagation cair be divided into bottom-up 
and top-down approaches. In the context of pure bottom-up materialization, the 
benefit of incremeirtal propagation rules is that the evaluation of their rule bodies 
can be restricted to the values of the currently propagated update such that 
the entire propagation process is very naturally limited to the actually affected 
derived relations. However, similar bottom-up approaches require to materialize 
the simulated state of derived relations completely in order to determine true 
updates. By contrast, if update propagation were based on a pure top-down 
approach, as proposed in [10,17], the simulation of the opposite state can be 
easily restricted to the relevant part by querying the relevant portion of the 
database. The disadvantage is that the induced changes can only be determined 
by queryiirg all existing derived relations, although most of them will probably 
not be affected by the update. Additionally, a more elaborated control is needed 
in order to implement the propagation of base updates to recursive views. 

Therefore, we propose to combine the advantages of top-down and bottom- 
up propagation. To this end, known transformation-based methods to update 
propagation in stratifiable databases (eg. [6,17]) are extended by incorporating 
the Magic Sets rewriting for simulating top down query evaluation. The re- 
sulting approach can be improved by other relational optimization techniques, 
hairdles non-linear recursion and may also propagate updates at arbitrary gran- 
ularity. We show that the transformed rules can be efficiently evaluated by the 
soft stratification approach [4] , solviirg stratification problems occurring iir other 
bottom-up methods. This simple set-orieirted fixpoint process is well-suited for 
being transferred into the SQL context and its (partial) materialization avoids 
expensive recomputations occurring in related transformation-based approaches 
(e.g. [5,17]). 

1.1 Related Work 

Methods for update propagation have been mainly studied in the context of Da- 
talog, relational algebra, and SQL. Methods in Datalog based on SLDNF resolu- 
tion cannot guarantee termination for recursively defined predicates (e.g. [17]). 
In addition, a set-oriented evaluation technique is preferred in the database con- 
text. Bottom-up methods either provide no goal-directed rule evaluation with 
respect to induced updates (e.g. [10]) or suffer from stratification problems (cf. 
Section 2) arising when transforming an original stratifiable schema (e.g. [6,12]). 
Heirce, for the latter approaches (for an overview cf. [6]) the costly application 
of more general evaluation techniques like the alternating fixpoint [20] is needed. 

In general, approaches formulated in relational algebra or SQL are not capa- 
ble of handling (non-linear) recursion, the latter usually based on transformed 
views or specialized triggers. Transformed SQL- views directly correspond to our 
proposed method in the non-recursive case. The application of triggers (e.g. pro- 
duction rules even for recursive relations in [5]), however, does not allow the 
reuse of intermediate results obtained by querying the derivability and effec- 
tiveness tests. In [7] an algebraic approach to view maintenance is presented 
which is capable of hairdling duplicates but cannot be applied to general recur- 
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sive views. For recursive views, [9] proposes the ’’Delete and Rederive” method 
which avoids the costly test of alternative derivations when computing induced 
deletions. However, this approach needs to compute overestimations of the tu- 
ples to be deleted and additional pretests are necessary to check whether a view 
is affected by a given update [11]. 

The importance of integrating Magic Sets with traditional relational opti- 
mizations has been discussed already in [15]. The structured propagation method 
in [6] represents a bottom-up approach for computing Magic Sets transformed 
propagation rules. However, as these rules are potentially unstratifiable, this 
approach is based on the alternating fixpoint computation [20] leading to an 
inefficient evaluation because the specific reason for unstratifiability is not taken 
into account. Therefore, we propose a less complex magic updates transforma- 
tion resulting in a set of rules which is not only smaller but may in addition be 
efficiently evaluated using the soft stratification approach. Thus, less joins have 
to be performed and less facts are generated. 

2 Basic Concepts 

We consider a first order language with a universe of constants U = {a, b,c , . . .}, 
a set of variables {X, Y,Z,.. .} and a set of predicate symbols {p,q,r . . A term 
is a variable or a constant (i.e., we restrict ourselves to function- free terms). Let 
p be an n-ary predicate symbol and (i = l,...,n and n > 0) terms then 
p{ti, . . . ,t„) (or simply p{t )) is denoted atom. An atom is ground if every ti is 
a constant. If A is an atom, we use pred(A) to refer to the predicate symbol of 
A. A fact is a clause of the form p{ti , . . . , •<— true where p(ti , . . . , is a 

ground atom. A literal is either an atom or a negated atom. A rule is a clause of 
the form p(ti, . . . , Li A ■■■ A with n > 0 and m > 1 where p(ti , . . . , 

is an atom denoting the rule’s head, and Li, . . . , are literals representing its 
body. We assume all deductive rules to be safe, i.e., all variables occurring in the 
head or in any negated literal of a rule must be also present in a positive literal 
in its body. If A is the head of a given rule R, we use pred(i?) to refer to the 
predicate symbol of A. For a set of rules TZ, pred(T^) is defined as Urg 7 ?,pred(r). 



Definition 1. A deductive database T> is a tuple {T,TZ) where T is a finite 
set of facts and TZ a finite set of rules such that pred(.7^) fl pred(T^) = 0. 
Within a deductive database T> = {TF,TZ), a predicate symbol p is called derived 
(view predicate), if p € predlTZ). The predicate p is called extensional (or base 
predicate), if p G pred(iF). 

For simplicity of exposition, and without loss of generality, we assume that a 
predicate is either base or derived, but not both, and that constants do neither 
occur in rule heads nor in body literals referring to a derived relation. Both con- 
ditions can be easily achieved by rewriting a given database. Before defining the 
semantics of a deductive database, we briefly introduce the notions stratification 
and soft stratification for partitioning a given deductive rule set. 



Update Propagation in Deductive Databases Using Soft Stratification 



25 



A stratification A on 2? is a mapping from the set of all predicate symbols 
Rel-D in T> to the set of positive integers IN inducing a partition of the given 
rule set such that all positive derivations of relations can be determined before 
a negative literal with respect to one of those relations is evaluated (cf [1]). 
For every partition 7^ = Pi U . . . U induced by a stratification the condition 
pred(Pi) n pred(Pj) = 0 with i ^ j must necessarily hold. 

In contrast to this, a soft stratification A® on P is a mapping from the set of 
all rules in T> to the set of positive integers IN inducing a partition of the given 
rule set for which the condition above does not necessarily hold (cf [4]). A soft 
stratification is solely defined for Magic Sets transformed rule sets (or Magic 
Updates rewritten ones as shown later on) which may be even unstratifiable. 

Given a deductive database P, the Herbrand base TLxi of P is the set of all 
ground atoms that can be constructed from the predicate symbols and constants 
occurring in P. Any subset / of 'Hx> is a Herbrand interpretation of P. Given a 
Herbrand interpretation I, its complement set with respect to the Herbrand base, 
i.e. TLt! \ I, is denoted I while -• • I represents the set that includes all atoms 
in I in negated form. Based on these notions, we define the soft consequence 
operator [4] which serves as the basic operator for determining the semantics of 
stratifiable or softly stratifiable deductive databases. 

Definition 2. Let P = (P, P) be a deductive database and P = Pi U . . . UP^ 
a partition of TZ. The soft consequence operator Tf, is a mapping on sets of 
ground atoms and is defined for X C "H-p as follows: 



where Tn denotes the immediate consequence operator by van Emden/ Kowalski. 

As the soft consequence operator Tf, is monotonic for stratifiable or softly strati- 
fiable databases, its least fixpoint Ifp (Pp,P) exists, where Ifp (Pp,P) denotes 
the least fixpoint of operator Tf, containing T with respect to a stratified or softly 
stratified partition of the rules in P. Given an arbitrary deductive database P, 
its semantics is defined by its well-founded model M.t> which is known to be 
two-valued for stratifiable or softly stratifiable databases. 

Lemma 1. Let P = (P, P) be a (softly) stratifiable deductive database and (N^ ) 
A a (soft) stratification of TZ inducing the partition V of TZ. The well-founded 
model Nix) o/(P, P) is identical with the least fixpoint model ofTf,, i.e., 



M-d = lfp(T|,,P) U-' • lfp(Tp,P). 

Proof, cf [4]. 

For illustrating the notations introduced above, consider the following example 
of a stratifiable deductive database P = (P, TZ) : 




X if Tp^(X)=X forall 

Tp.{X) with i = min{j | Tp.{X) DP}, otherwise. 



TZ: one_way(X) •<— path(X, Y) A ->path(Y, X) 

path(X,Y) Y-edge(X,Y) 
path(X, Y) ^ edge(X, Z) A path(Z, Y) 



Py edge(l,2) 



edge(2, 1) 
edge(2,3) 
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Relation path represents the transitive closure of relation edge while relation 
one_way selects all path{X,Y)-iacts where Y is reachable from X but not vice 
versa. A stratification induces (in this case) the unique partition V = Pi U P 2 
with Pi comprising the two path-rules while P 2 includes the one_way-rule. The 
computation of lfp(Pp,P) then induces the following sequence of sets 

Pi := P 

P 2 := Tpi(Pi) = {path(l,2), path(2,l), path(2,3)} U Pi 
P 3 := Tp^{F 2 ) = {path(l,l), path( 2 , 2 ), path(l,3)} U P 2 
P 4 := Tp^iFs) = {one_way(l), one_way(2)} U P 3 
P 5 := Fi. 

The fixpoint P 5 coincides with the positive portion of the well-founded model 
Adx> of T), i.e. A4x> = P 5 U ■ P 5 . 

3 Update Propagation 

We refrain from presenting a concrete update language but rather concentrate 
on the resulting sets of update primitives specifying insertions and deletions of 
individual facts. In principle every set oriented update language can be used that 
allows the specification of modifications of this kind. We will use the notion Base 
Update to denote the ’true’ changes caused by a transaction; that is, we restrict 
the set of facts to be updated to the minimal set of updates where compensation 
effects (given by an insertion and deletion of the same fact or the insertion of 
facts which already exist in the database) are already considered. 

Definition 3. Let T> = (P, TV) he a stratifiable database. A base update up is 
a pair where and u~jj are sets of base facts with pred(u^ U uf,) C 

pred(P), UpHufi = 0, it^flP = 0 and ufj C P. The atoms Up represent facts 
to be inserted into T>, whereas Up contains the facts to be deleted from T>. 

We will use the notion induced update to refer to the entire set of facts in which 
the new state of the database differs from the former after an update of base 
tables has been applied. 

Definition 4. Let T> he a stratifiable database, Aip the semantics ofT> and up 
an update. Then up leads to an induced update up^pi from D to D' which is 
a pair (u~jj^p,,Up^p,) of sets of ground atoms such that u~jj^p, = Md\Md 
and Up^p, = Md\Md'- The atoms u~jj^p, represent the induced insertions, 
whereas Up^p, consists of the induced deletions. 

The task of update propagation is to provide a description of the overall occurred 
modifications in up^p>. Technically, such a description is given by a set of 
delta facts for any affected relation which may be stored in corresponding delta 
relations. For each predicate symbol p € pred(P), we will use a pair of delta 
relations (2Yp, ZY p) representing the insertions and deletions induced on p by 
an update up. In the sequel, a literal L which references a delta relation is called 
delta literal. In order to abstract from negative and positive occurrences of atoms 
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in rule bodies, we use the superscripts and for indicating what kind of 
delta relation is to be used. For a positive literal A = p(ti , . . . ,t„) we define 
A~^ = Z\+ p(ti, ■ ■ ■ ,tn) and A~ = A~p{ti, . . . ,tn)- For a negative literal L = -•A, 
we use L+ := A~ and L~ := A~^. 

In the following, we develop transition rules and propagation rules for defining 
such delta relations. First, quite similar to query seeds used in the Magic Sets 
method, we generate a set of delta facts called propagation seeds. 

Definition 5. Let V be a stratifiable deductive database and ud = a 

base update. The set of propagation seeds prop_seeds(it£)) with respect to ud is 
defined as follows: 

prop_seeds(it£)) := { A^p{ci , . . . , c„) | p(ci, . . . , c„) G and tt G {+, -}}. 

Propagation seeds form the starting point from which induced updates, repre- 
sented by derived delta relations, are computed. An update propagation method 
can only be efficient if most derived facts eventually rely on at least one fact in 
an extensional delta relation. 

Generally, for computing true updates references to both the old and new 
database state are necessary. We will now investigate the possibility of dropping 
the explicit references to one of the states by deriving it from the other one and 
the given updates. The benefit of such a state simulation is that the database 
system is not required to store both states explicitly but may work on one 
state only. The rules defining the simulated state will be called transition rules 
according to the naming in [17]. 

Although both directions are possible, we will concentrate on a somehow 
pessimistic approach, the simulation of the new state while the old one is actually 
given. The following discussion, however, can be easily transferred to the case of 
simulating the old state [6]. In principle, transition rules can be differentiated by 
the way how far induced updates are considered for simulating the other database 
state. We solely use so-called naive transition rules which derive the new state 
from the physically present old fact base and the explicitly given updates. The 
disadvantage of these transition rules is that each derivation with respect to the 
new state has to go back to the extensional delta relations and hence makes no 
use of the implicit updates already derived during the course of propagation. In 
the Internal Events Method [17] as well as in [12] it has been proposed to improve 
state simulation by employing not only the extensional delta relations but the 
derived ones as well. However, the union of the original, the propagation and this 
kind of transition rules is not always stratifiable, and may even not represent 
the true induced update anymore under the well-founded semantics [6]. 

Assuming that the base updates are not yet physically performed on the 
database, from Definition 4 follows that the new state can be computed from 
the old one and the true induced update ud^d' = 

M-d' = 

We will use the mapping new for referring to the new database state which 
syntactically transform the predicate symbols of the literals it is applied to. 
Using the equation above directly leads to an equivalence on the level of tuples 
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new A {Af\—'{A )) V A’*'. 

which holds if the referenced delta relations correctly describe the induced update 
ud^d'- Note that we assume the precedence of the superscripts and 
to be higher than the one of Thus, we can omit the brackets in -i(A“) and 
simply write -<A~ . Using Definition 5 and the equivalence above, the deductive 
rules for inferring the new state of extensional relations can be easily derived. 
For instance, for the extensional relation edge of our running example the new 
state is specified by the rules (in the sequel, all relation names are abbreviated) 

new e(X, Y) ^ e(X, Y) A -.Z\~e(X, Y) 
new e(X, Y) ^ Z\+e(X, Y), 

From the new states of extensional relations we can successively infer the new 
states of derived relations using the dependencies given by the original rule set. 
To this end, the original rules are duplicated and the new mapping is applied to 
all predicate symbols occurring in the new rules. For instance, the rules 

new o(X) ^ new p(X, Y) A ^new p(Y, X) 

new p(X,Y) -S— new e(X, Y) 

new p(X, Y) -S— new e(X, Z) A new p(Z, Y) 

specify the new state of the relations path and one_way. Note that the application 
of -I and the mapping new is orthogonal, i.e. new-iA = -inew A, such that the 
literal new->p(Y,X) from the example above may be replaced by -inew p(Y, X). 

Definition 6. Let T> = {tF, TZ) be a stratifiable deductive database. Then the set 
of naive transition rules for true updates and new state simulation with respect 
to TZ is denoted t{TZ) and is defined as the smallest set satisfying the following 
conditions: 

1. For each n-ary extensional predicate symbol p € pred{TF), the direct transi- 
tion rules 

new A •«— A A -<A~ new A •<— A~^ 

are in t{TZ) where A = p{x \, . . . , a;„), and the Xi are distinct variables. 

2. For each rule A <— Li A . . . A € TZ, an indirect transition rule of the form 

new A ^ new (Li A ... A Ln) 
is in t{TZ). 

It is obvious that if TZ is stratifiable, the rule set TZAt{TZ) must be stratifiable as 
well. The following proposition shows that if a stratifiable database T> = {TF, TZ) 
is augmented with the naive transition rules t(TZ) as well as the propagation 
seeds prop_seeds(it£)) with respect to a base update ud, then t(TZ) correctly 
defines the new database state. 

Proposition 1. Let T> = {TF,TZ) be a stratifiable database, ux> an update and 
ujo^D' = the corresponding induced update from T> to T>' . Let 

T>°‘ = U prop_seeds(ux)), TZ\J t(TZ)) be the augmented deductive database 
ofT>. Then correctly represents the implicit state of T>' , i.e. for all atoms 
AgFLd’ holds A€ ATd’ new A € 
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The proof of this proposition is omitted as it directly follows from Definition 5 
and from the fact that the remaining transition rules are a copy of those in 
TZ with the predicate symbols correspondingly replaced. We will now introduce 
incremental propagation rules for true updates. Basically, an induced insertion 
or deletion can be represented by the difference between the two consecutive 
database states. However, for efficiency reasons we allow to reference delta rela- 
tions in the body of propagation rules as well: 

Definition 7. Let TZ be a stratifiable deductive rule set. The set of propagation 
rules for true updates with respect to TZ, denoted g>{TZ), is defined as the smallest 
set satisfying the condition: 

For each rule A Li A . . . A Ln G TZ and each body literal Li (i = 1, . . . , n) two 
propagation rules of the form 

A~^ — Li^ A new (Ti A ... A Li—i A Lij,.\ A ... A Ljfj A ~*A 
A i — L^ A (-^1 A ... A Li— I A Li.^\ A ... A L^f^ A new ~*A 

are in (f{TZ) . The literals new Lj and Lj (j = 1, ... ,i — l,i + 1, ... ,n) are called 
side literals of Li . 

The propagation rules perform a comparison of the old and new database state 
while providing a focus on individual updates by applying the delta literals 
Lf with 7T € {+,—}■ Each propagation rule body may be divided into the 
derivability test and the effectiveness test. The derivability test {Lf A {new} 
{Lx A ... A Li-i A Li+i A ... A L„)) checks whether A is derivable in the 
new respectively old state. The effectiveness test (called derivability test in [21] 
and redundancy test in [10]) ({new}(-iA)) checks whether the fact obtained by 
the derivability test is not derivable in the opposite state. In general, this test 
cannot be further specialized as it checks for alternative derivations caused by 
other rules defining pred(H). 

The obtained propagation rules and seeds as well as transition rules can 
be added to the original database yielding a safe and stratifiable database. The 
safeness of propagation rules immediately follows from the safeness of the original 
rules. Furthermore, the propagation rules cannot jeopardize stratifiability, as 
delta relations are always positively referenced and thus cannot participate in 
any negative cycle. Consider again the rules from Section 2: 

1. o(X) ^p(X,Y) A-p(Y,X) 

2. p(X,Y)^e(X,Y) 

3. p(X,Y)^e(X,Z)Ap(Z,Y) 

The corresponding propagation rules would look as follows: 

1. Z\+o(X) ^ Z\+p(X,Y)Aneinp(Y,X) A ^o(X) 

Z\+o(X) ^ Z\'^p(Y,X)Anew p(X,Y)A -■o(X) 

Z\”o(X) ^Z\~p(X,Y)A ^p(Y,X) A new-.o(X) 

Z\”o(X) ^Z\+p(Y,X)A p(X,Y) A new-.o(X) 
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2. Z\+p(X,Y) 




Z\+e(X,Y) 




A 


-P(X,Y) 


Z\-p(X,Y) 




Z\-e(X,Y) 




A 


new-'p(X, Y) 


3. zi+p(X,Y) 




Z\+ e(X, Z)A new 


P(Z,Y) 


A 


-P(X,Y) 


Z\+p(X,Y) 




Z\+ p(Z, Y)A new 


e(X,Z) 


A 


-P(X,Y) 


Z\-p(X,Y) 




A-e{X,Z)A 


P(Z,Y) 


A 


new-ip(X, Y) 


Zi-p(X,Y) 




Zi^p(Z,Y)A 


e(X,Z) 


A 


new-ip(X, Y) 



Note that the upper indices tt of the delta literal ZTp(Y, X) in the propagation 
rules for defining Z\’^o(X) are inverted as p is negatively referenced by the cor- 
responding literal in the original rule. Each propagation rule includes one delta 
literal for restricting the evaluation to the changes induced by the respective 
body literal. Thus, for each possible update (i.e., insertion or deletion) and for 
each original rule 2n propagation rules are generated if n is the number of body 
literals. It is possible to substitute not only a single body literal but any sub- 
set of them by a corresponding delta literal. This provides a better focus on 
propagated updates but leads to an exponential number of propagation rules. 

Proposition 2. Let T> = {J- ^ TZ) be a stratifiable database, ux> an update and 

the corresponding induced update from T> to T>' . Let 
T>°‘ = {T \J prop_seeds(uxi), TZ U t{TZ) U '•p{TZ)) be the augmented deductive 
database ofT>. Then the delta relations defined by the propagation rules ip{TZ) cor- 
rectly represent the induced update ud^d'- Hence, for each relation p € pred(2?) 
the following conditions hold: 

A+p{t ) e Mv- p{t ) € 

A-pft ) € Mv- p{t ) e ■ 

Proof, cf. [6, p. 161-163]. 

Transition as well as propagation rules can be determined at schema definition 
time and don’t have to be recompiled each time a new base update is applied. 
For propagating true updates the results from the derivability and effectiveness 
test are essential. However, the propagation rules can be further enhanced by 
dropping the effectiveness test or by either refining or even omitting the deriv- 
ability test in some cases. As an example, consider a derived relation which is 
defined without an implicit union or projection. In this case no multiple deriva- 
tions of facts are possible, and thus the effectiveness test in the corresponding 
propagation rules can be omitted. Additionally, the presented transformation- 
based approach solely specifies true updates, but can be extended to describe 
the induced modifications at an arbitrary granularity (cf. [6,14]) which allows for 
cutting down the cost of propagation as long as no accurate results are required. 
In the sequel, however, we will not consider these specialized propagation rules 
as these optimizations are orthogonal to the following discussion. 

Although the application of delta literals indeed restricts the computation of 
induced updates, the side literals and effectiveness test within the propagation 
rules as well as the transition rules of this example require the entire new and old 
state of relation e, p and o to be derived within a bottom-up materialization. The 
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reason is that the supposed evaluation over the two consecutive database states 
is performed using deductive rules which are not specialized with respect to the 
particular updates that are propagated. This weakness of propagation rules in 
view of a bottom-up materialization will be cured by incorporating Magic Sets. 



4 Update Propagation via Soft Stratification 



In Section 3 we already pointed to the obvious inefficiency of update propagation, 
if performed by a pure bottom-up materialization of the augmented database. In 
fact, simply applying iterated fixpoint computation [1] to an augmented database 
implies that all old and new state relations will be entirely computed. The only 
benefit of incremental propagation rules is that their evaluation can be avoided 
if delta relations are empty. In a pure top-down approach, however, the values of 
the propagated updates can be passed to the side literals and effectiveness tests 
automatically restricting their evaluation to relevant facts. The disadvantage is 
that all existing delta relations must be queried in order to check whether they 
are affected by an update, although for most of them this will not be the case. 

In this section we develop an approach which combines the advantages of the 
two strategies discussed above. In this way, update propagation is automatically 
limited to the affected delta relations and the evaluation of side literals and effec- 
tiveness tests is restricted to the updates currently propagated. We will use the 
Magic Sets approach for incorporating a top-down evaluation strategy by consid- 
ering the currently propagated updates in the dynamic body literals as abstract 
queries on the remainder of the respective propagation rule bodies. Evaluating 
these propagation queries has the advantage that the respective state relations 
will only be partially materialized. Moreover, later evaluations of propagation 
queries can re-use all state facts derived in previous iteration rounds. 



4.1 Soft Update Propagation by Example 

Before formally presenting the soft update propagation approach, we will illus- 
trate the main ideas by means of an example. Let us consider the following 
stratifiable deductive database V = {T, TZ) with 

TZa p(X,Y) ^ e(X,Y) 

p(X,Y)^e(X,Z)Ap(Z,Y) 

e(l,2), e(l,4), e(3,4), e(10,ll), e (11 , 12) , . . . , e(99 , 100) 

The positive portion of the corresponding total well-founded model M-d = 
LJ-' • consists of 4098 p-facts, i.e. |A1^| = 4098 -I- |e| = 4191 facts. For 
maintaining readability we restrict our attention to the propagation of insertions. 
Let the mapping new for a literal A = r{x) be defined as new A := The 

respective propagation rules ip{TZ) are 
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Z\+p(X, Y) ^ Z\+e(X, Y)A -p(X, Y) 

Z\+p(X, Y) ^ A+e{X, Z) A p“®”(Z, Y)A^p(X, Y) 

Z\+p(X, Y) ^ Z\+p(Z, Y) A e“"(X, Z)A^p(X, Y) 

while the naive transition rules t{TZ) are 

p“®”(X,Y)^ e““(X,Y) e“'®”(X,Y)^ e(X,Y) A-iZ\”e(X,Y) 

p“®”(X,Y)^ e““(X,Z) Ap°®”(Z,Y) e“'®”(X,Y)^ Z\+e(X,Y). 



Let ut> be an update consisting of the new edge fact e(2, 3) to be inserted into 
T>, i.e. = {e(2,3)}. The resulting augmented database is then given 
by = {T U {Z\+ e(2,3)},7^“) with 7?.“ = TZU if(TZ) U t(TZ). Evaluating the 
stratifiable database T>°‘ leads to the generation of 8296 facts for computing the 
three induced insertions p(l,3), Z\+p(2,3), and Z\"''p(2,4) with respect to p. 

We will now apply our Magic Updates rewriting to the rules above with 
respect to the propagation queries represented by the set ={Z\+e(X,Y), 
Z\+ e(X, Z), Zi+p(Z,Y)} of delta literals in the propagation rule bodies. Let 7 ?.qu 
be the adorned rule set of 7?.“ with respect to the propagation queries The 
rule set resulting from the Magic Updates rewriting will be denoted mu(7?.Qu) and 
consists of the following answer rules for our example 

Z\+p(X, Y) ^ Zi+e(X, Y) A -Pbb(X, Y) 

Z\+p(X, Y) ^ Zi+e(X, Z) A p^r (Z, Y) A -Pbb(X, Y) 

Z\+p(X, Y) ^ Z\+p(Z, Y) A e“”(X, Z) A -pbb(X, Y) 

p-“(X, Y) ^ m_p“r(X) A e-“(X, Z) A p“|“(Z, Y) p-”(X, Y) ^ m_p“" (X) A e°r(X, Y) 

Pbb(X, Y) ^ m_pbb(X, Y) A e(X, Z) A pbb(Z, Y) pbb(X, Y) A- m_pbb(X, Y) A e(X, Y) 

e?-(X, Y) ^ m_e?-(Y) A e(X, Y) A -Z\^e(X, Y) e?-(X, Y) ^ m_e“r(Y) A Z\+e(X,Y) 

e“r(X,Y) ^m_eSr(X)A e(X, Y) A -Z\”e(X, Y) e“”(X, Y) ^ m_e““(X) A Z\+e(X,Y) 



as well as the following subquery rules 

m_p“r(Z)^m_p““(X)Aer(X,Z) 
m_e-“(Z) ^ Zl+p(Z,Y) 
m_pbb(X,Y) ^ Z\+p(Z,Y) A e“”(X,Z) 
m_pbb(X, Y) ^ Z\+e(X, Z) A pgr(Z, Y) 



m_p“"(Z) ^ Z\+e(X,Z) 
m_e““(X) ^ m_p-“(X) 
m_pbb(X,Y) A- Z\+e(X,Y) 
m_pbb(Z,Y) ^ m_pbb(X,Y) A e(X,Z). 



Quite similar to the Magic sets approach, the Magic Updates rewriting may 
result in an unstratifiable rule set. This is also the case for our example where the 
following negative cycle occurs in the respective dependency graph of mu(7?.qa): 



zV' p 



pos pos 

> n>-Pbb — > 



Pbb 



Zl+p 



We will show, however, that the resulting rules must be at least softly stratifi- 
able such that the soft consequence operator could be used for efficiently com- 
puting their well-founded model. Computing the induced update by evaluating 
qyma _ fj j^A e(2, 3)}, mu(7?.q„)) leads to the generation of two new state 
facts for e, one old state fact and one new state fact for p. The entire number 



Update Propagation in Deductive Databases Using Soft Stratification 



33 



of generated facts is 19 in contrast to 8296 for computing the three induced 
insertions with respect to p. The reason for the small number of facts is that 
only relevant state facts are derived which excludes all p facts derivable from 
{e(10, 11), e(ll, 12), . . . , e(99, 100)} as they are not affected by Z\+ e(2, 3). 

Although this example already shows the advantages of applying Magic Sets 
to the transformed rules from Section 3, the application of Magic Updates rules 
does not necessarily improve the performance of the update propagation pro- 
cess. This is due to cases where the relevant part of a database represented by 
Magic Sets transformed rules together with the necessary subqueries exceeds 
the amount of derivable facts using the original rule set. For such cases further 
rule optimizations have been proposed (e.g. [16]) which can be also applied to a 
Magic Updates transformed rule set, leading to a well-optimized evaluation. 

4.2 The Soft Update Propagation Approach 

In this section we formally introduce the soft update propagation approach. To 
this end, we define the Magic Updates rewriting and prove its correctness. 

Definition 8 (Magic Predicates). Let A = Pad(x) be a positive literal with 
adornment ad and bd(a:) the sequence of variables within x indicated as bound 
in the adornment ad. Then the magic predicate of A is defined as magic(A} .■= 
'nv-PadfbdL(x)) . If A = -'Pad(x) is a negative literal, then the magic predicate of 
A is defined as magic (A} .•= m_pad(bd(o;)). 

Given a rule set TZ and an adorned query Q = Pad(c) with p G pred(7?.), the 
adorned rule set of TZ with respect to Q shall be denoted TZq. Additionally, let 
ms(7?.Q) be the set of Magic Sets transformed rules with respect to TZq. 

Definition 9 (Magic Updates Rewriting). Let TZ be a stratifiable rule set, 
TZ°‘ = TZ\J p{TZ) U t{TZ) an augmented rule set of TZ, and the set of abstract 
propagation queries given by all delta literals occurring in rule bodies of propa- 
gation rules in <p{TZ). The Magic Updates rewriting ofTZ°' yields the magic rule 
set mu(7?.Qu ) := TZp U TZ'g U TZf^ where TZp, TZ'g and TZ'^ are defined as follows: 

1. From ip{TZ) we derive the two deductive rule sets Tlf, and TZ'g: For each 
propagation rule ^ ZTe A A ... A L” G t{T^) with ATe G Q'^ is a 
dynamic literal and tt, tt G {-b, — }, an adorned answer rule of the form 

A-^A-eALl,^A...AL-,^ 

is in TZf, where each non-dynamic body literal L^ {1 < i < n) is replaced 
by the corresponding adorned literal while assuming the body literals 

ATeAL^ A . . .AL®“^ have been evaluated in advance. Note that the adornment 
of each non-derived literal consists of the empty string. For each derived 
adorned body literal (1 < i < n) a subquery rule of the form 

^ A A ... A 

is in TZg. No other rules are in TZf, and TZg. 



34 



A. Behrend and R. Manthey 



2. From the := TZIl)t(TZ) we derive the rule setTZ\j: For each relation 

symbol magic(Lad) S pred(T^g) the corresponding Magic Set transformed 
rule set is in TVfj where W = Lad represents an adorned query 

with pred(L) G pred(7^'**“*®) and is the adorned rule set of 

with respect to W. No other rules are in TZf^. 

Theorem 1. LefD = {T^TZ) he a stratifiable database, ux> an update, = 

the corresponding induced update from T> to T>' , Q“ the set of 
all abstract queries in (fi{TZ), and TZF = 'R,yjLp(TZ) Ur(7?.) an augmented rule set 
ofTZ. Lei mu(7?.Q„) be the result of applying Magic Updates rewriting to TiF and 
T>ma = (:F U prop_seeds(ux)),mu(7?.g„)) the corresponding augmented deductive 
database of T>. Then 2?™“ is softly stratifiable and all delta relations in 
correctly represent the induced update ud^d', i-e- for all atoms A € Hd' with 
A = p{t ): 

A+p{t ) G TWiJ-a p{t ) G 

A~p{t ) G 7Wl,ma pit ) G . 

Proof (Sketch). The correctness of the Magic Updates rewriting with respect to 
an augmented rule set 7?.“ is shown by proving it to be equivalent to a specific 
Magic Set transformation of 7?.“ which is known to be sound and complete. A 
Magic Sets transformation starts with the adornment phase which basically de- 
picts information flow between literals in a database according to a chosen sip 
strategy. In [2] it is shown that the Magic Sets approach is also sound for so-called 
partial sip strategies which may pass on only a certain subset of captured vari- 
able bindings or even no bindings at all. Let us assume we have chosen such a sip 
strategy which passes no bindings to dynamic literals such that their adornments 
are strings solely consisting of 'f' symbols representing unbounded attributes. 
Additionally, let TZp = TiA A {h -h- AZ^^pl{xi)} U . . . U {L ^ /N^pn{xn)} be an 
extended augmented rule set with rules for defining an auxiliary 0-ary relation 
h with h ^ pred((^(7^)), . . . , AF'^pn} = pred((^(7?.)) distinct predicates, 

and Xi (i = 1, . . . , n) vectors of pairwise distinct variables with a length accord- 
ing to the arity of the corresponding predicates AT'^pi. Relation h references all 
derived delta relations in (p{TZ) as they are potentially affected by a given base 
update. Note that since 7^“ is assumed to be stratifiable, TZj^ must be stratifiable 
as well. The Magic Sets rewriting of TZp with respect to the query H = h using a 
partial sip strategy as proposed above yields the rule set sis{TZp) which is basi- 
cally equivalent to the rule set mu(7?.Qu ) resulting from the Magic Updates rewrit- 
ing. The set ms(7?.|^) differs from mu(7?.Qu) by the answer rules of the form h ^ 
mJi, AF^plff ixi), ... ,h m_h, AF'^puf f,,fxn) for the additional relation h, 
by subquery rules of the form mJi, . . . ,m-AF'^]mff,,, m_h, 

by sub query rules of the form m_/N''pif j,,, •<— m-AF^pjff,, withi,j G {!,... ,n}, 
and by the usage of m_AF’'pijj: literals in propagation rule bodies for defining 

a corresponding delta relation AT'^piff ,,,. Obviously, these rules and literals can 
be removed from ms(7?.|^) without changing the semantics of the remaining delta 
relations which themselves coincide with the magic updates rules mu(7^Q„). Us- 
ing the Propositions 1 and 2, it can be followed that TZ^ is stratifiable and 
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all delta relations defined in it correctly represent the induced update ud^d' ■ 
Thus, the Magic Sets transformed rules ms(7^|^) must be sound and complete as 
well. As the magic updates rules mu(7?.Q„) can be derived from ms{'R,p) in the 
way described above, they must correctly represent the induced update ud^d' 
as well. In addition, since ms(7?.^) is softly stratifiable, the magic updates rules 
mu(7?.Q„) must be softly stratifiable, too. 

From Theorem 1 follows that the soft stratification approach can be used for 
computing the induced changes represented by the augmented database 
For instance, the partition V = P\^ P 2 of the Magic Updates transformed rule 
set mu(7?.Qu) of our running example with P\ consisting of 

Pbb(X, Y) ^ m_pbb(X, Y) A e(X, Z) A pbb(Z, Y) pbb(X, Y) ^ m_pbb(X, Y) A e(X, Y) 

m_pbb(X, Y) ^ Z\+p(Z, Y) A e“”(X, Z) m_pbb(X, Y) ^ Z\+e(X, Y) 

m_pbb(X,Y) ^ Zi+e(X,Z) A m_pbb(Z,Y) ^ m_pbb(X,Y) A e(X,Z). 

and with partition P 2 consisting of all rules, i.e. P 2 := mu(7^q„) \ Pi, satisfies 
the condition of soft stratification. Using the soft consequence operator for the 
determination of lfp(Tp, .7^U{Z\+e(2, 3)}) then yields their well-founded model. 

5 Conclusion 

We have presented a new bottom-up evaluation method for computing the im- 
plicit changes of derived relations resulting from explicitly performed updates of 
the extensional fact base. The proposed transformation-based approach derives 
propagation rules by means of range-restricted Datalog^ rules which can be 
automatically generated from a given database schema. We use the Magic Sets 
method to combine the advantages of top-down and bottom-up propagation ap- 
proaches in order to restrict the computation of true updates to the affected 
part of the database only. The proposed propagation rules are restricted to the 
propagation of insertions and deletions of base facts in stratifiable databases. 
However, several methods have been proposed dealing with further kinds of up- 
dates or additional language concepts. As far as the latter are concerned, update 
propagation in the presence of built-ins and (numerical) constraints has been dis- 
cussed in [22], while views possibly containing duplicates are considered in [5,7]. 
Aggregates and updates have been investigated in [3,9]. As for the various types 
of updates, methods have been introduced for dealing with the modification of in- 
dividual tuples, e.g. [5,19], the insertion and deletion of views (respectively rules) 
and constraints, e.g. [13,18], and even changes of view and constraint definitions, 
e.g. [8]. All these techniques allow for enhancing our proposed framework. 
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Abstract. Query rewriting method is proposed for the heterogeneous informa- 
tion integration infrastructure formed by the subject mediator environment. Lo- 
cal as View (LAV) approach treating schemas exported by sources as material- 
ized views over virtual classes of the mediator is considered as the basis for the 
subject mediation infrastructure. In spite of significant progress of query rewrit- 
ing with views, it remains unclear how to rewrite queries in the typed, object- 
oriented mediator environment. This paper embeds conjunctive views and que- 
ries into an advanced canonical object model of the mediator. The “selection- 
projection-join” (SPJ) conjunctive query semantics based on type specification 
calculus is introduced. The paper demonstrates how the existing query rewriting 
approaches can be extended to be applicable in such typed environment. The 
paper shows that refinement of the mediator class instance types by the source 
class instance types is the basic relationship required for establishing query con- 
tainment in the object environment. 



1 Introduction 

This work has been performed in frame of the project [12] aiming at building large 
heterogeneous digital repositories interconnected and accessible through the global 
information infrastructure'. In this infrastructure a middleware layer is formed by 
subject mediators providing a uniform ontological, structural, behavioral and query 
interface to the multiple data sources. In a specific domain the subject model is to be 
defined by the experts in the field independently of relevant information sources. This 
model may include specifications of data structures, terminologies (thesauri), con- 
cepts (ontologies), methods applicable to data, processes (workflows), characteristic 
for the domain. These definitions constitute specification of a subject mediator. After 
subject mediator had been specified, information providers can register their informa- 
tion at the mediator for integration in the subject domain. Users should know only 
subject domain definitions that contain concepts, data structures, methods approved 
by the subject domain community. Thus various information sources belonging to 
different providers can be registered at a mediator. The subject mediation is applica- 
ble to various subject domains in science, cultural heritage, mass media, e-com- 
merce, etc. 
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Local as View (LAV) approach [8] treating schemas exported by sources as mate- 
rialized views over virtual classes of the mediator is considered as the basis for the 
subject mediation infrastructure. This approach is intended to cope with dynamic, 
possibly incomplete set of sources. Sources may change their exported schemas, be- 
come unavailable from time to time. To disseminate the information sources, their 
providers register them (concurrently and at any time) at respective subject mediators. 
A method and tool supporting process of information sources registration at the me- 
diator were presented in [1]. The method is applicable to wide class of source specifi- 
cation models representable in hybrid semi-structured/object canonical mediator 
model. Ontological specifications are used for identification of mediator classes se- 
mantically relevant to a source class. A subset of source information relevant to the 
mediator classes is discovered based on identification of maximal commonality be- 
tween a source and mediated level class specification. Such commonality is estab- 
lished so that compositions of mediated class instance types could be refined by a 
source class instance type. 

This paper (for the same infrastructure as in [1]) presents an approach for query 
rewriting in a typed mediator environment. The problem of rewriting queries using 
views has recently received significant attention. The data integration systems de- 
scribed in [2,13] follow an approach in which the contents of the sources are de- 
scribed as views over the mediated schema. Algorithms for answering queries using 
views that were developed specifically for the context of data integration include the 
Bucket algorithm [13], the inverse-rules algorithm [2,3,15], MiniCon algorithm [14], 
the resolution-based approach [7], the algorithm for rewriting unions of general con- 
junctive queries [17] and others. 

Query rewriting algorithms evolved into conceptually simple and quite efficient 
constructs producing the maximally-contained rewriting. Most of them have been 
developed for conjunctive views and queries in the relational, actually typeless data 
models (Datalog). In spite of significant progress of query rewriting with views, it 
remains unclear how to rewrite queries in the typed, object-oriented mediator envi- 
ronment. This paper is an attempt to fill in this gap. The paper embeds conjunctive 
views and queries into an advanced canonical object model of the mediator [9,11]. 
The “selection-projection-join” (SPJ) conjunctive query semantics based on type 
specification calculus [10] is introduced. The paper shows how the existing query 
rewriting approaches can be extended to be applicable in such object framework. To 
be specific, the algorithm for rewriting unions of general conjunctive queries [17] has 
been chosen. The resulting algorithm for the typed environment proposed in the paper 
exploits the heterogeneous source registration facilities [1] that are based on the refin- 
ing mapping of the specific source data models into the canonical model of the media- 
tor, resolving ontological differences between mediated and local concepts as well as 
between structural, behavioral and value conflicts of local and mediated types and 
classes. Due to the space limit, this paper does not consider various aspects of query 
rewriting, e.g., such issues as complexity of rewriting, possibility of computing all 
certain answers to a union query are not discussed: these issues are built on a well 
known results in the area (e.g., it is known that the inverse-rules algorithm produces 
the maximally-contained rewriting in time that is polynomial in the size of the query 
and the views [8]). 

The paper is structured as follows. After brief analysis of the related works, an over- 
view of the basic features of the canonical object model of the subject mediator is 
given. This overview is focused mostly on the type specification operations of the 
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model that constitute the basis for object query semantics. In Section 4 an object 
query language oriented on representation of union of conjunctive queries under SPJ 
set semantics is introduced. Section 5 and 6 provide source registration and query 
rewriting approach for such query language. Section 7 gives an example of query 
rewriting in the typed environment. Results are summarized in the conclusion. 



2 Related Work 

The state of the art in the area of answering queries using views ranging from theo- 
retical foundations to algorithm design and implementation has been surveyed in [8]. 
Additional evaluations of the query rewriting algorithms can be found in the recent 
papers [7,17] that have not been included into the survey [8]. Inverse rules algorithms 
are recognized due to their conceptual simplicity, modularity and ability to produce 
the maximally-contained rewriting in time that is polynomial in the size of the query 
and the views. Rewriting unions of general conjunctive queries using views [17] 
compares favorably with existing algorithms, it generalizes the MiniCon [14] and U- 
join [15] algorithms and is more efficient than the Bucket algorithm. Finding con- 
tained rewritings of union queries using general conjunctive queries (when the query 
and the view constraints both may have built-in predicates) are important properties 
of the algorithm [17]. 

Studies of the problem of answering queries using views in the context of querying 
object-oriented databases [4,5] exploited some semantic information about the class 
hierarchy as well as syntactic peculiarities of OQL. No concern of object query se- 
mantics in typed environment has been reported. 

In the paper [6] in different context (logic-based query optimization for object da- 
tabases) it has been shown how the object schema can be represented in Datalog. 
Semantic knowledge about the object data model, e.g., class hierarchy information, 
relationship between objects, as well as semantic knowledge about a particular 
schema and application domain are expressed as integrity constraints. An OQL object 
query is represented as a logic query and query optimization is performed in the Data- 
log representation. 

Main contribution of our work is providing an extension of the query rewriting ap- 
proach using views for the typed subject mediation environment. In contrast with 
[17], we extend conjunctive queries with object SPJ semantics based on type refine- 
ment relationship and type calculus. The paper shows that refinement of the mediator 
class instance types by the source class instance types is the basic relationship re- 
quired for establishing query containment in the object environment. 



3 Overview of the Basic Features of the Canonical Model 

In the project [12] for the canonical model of a mediator we choose the SYNTHESIS 
language [9] that is a hybrid semi-structured/object model [11]. Here only the very 
basic canonical language features are presented to make the examples demonstrating 
ideas of the query rewriting readable. It is important to note that the compositional 
specification calculus considered [10] does not depend on any specific notation or 
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modeling facilities. The canonical model [9] provides support of wide range of data - 
from untyped data on one end of the range to strictly typed data on another. 

Typed data should conform to abstract data types (ADT) prescribing behaviour of 
their instances by means of the type's operations. ADT describes interface of a type 
whose signature define names and types of its operations. Operation is defined by a 
predicative specification stating its mixed pre/post conditions. Object type is a sub- 
type of a non-object ADT with an additional operation self on its interface providing 
OlDs. In this paper only typed capabilities of the SYNTHESIS language are ex- 
ploited. Query rewriting with semi-structured (frame) data is planned to be discussed 
in the future works. 

Sets in the language (alongside with bags, sequences) are considered to be a gen- 
eral mechanism of grouping of ADT values. A class is considered as a subtype of a 
set type. Due to that these generally different constructs can be used quite uniformly: 
a class can be used everywhere where a set can be used. For instance, for the query 
language formulae the starting and resulting data are represented as sets of ADT val- 
ues (collections) or of objects (classes). 



3.1 Type Specification Operations 

Semantics of operations over classes in the canonical model are explained in terms of 
the compositional specification calculus [10]. The manipulations of the calculus in- 
clude decomposition of type specifications into consistent fragments, identification of 
common fragments, composition of identified fragments into more complex type 
specifications conforming to the resulting types of the SPJ operations. The calculus 
uses the following concepts and operations. 

A signature ij, of a type specification T = <V^ Oj, Ij> includes a set of operation 
symbols Oj indicating operations argument and result types and a set of predicate 
symbols (for the type invariants) indicating predicate argument types. Conjunction 
of all invariants in Ij. constitutes the type invariant. We model an extension Vj. of each 
type T (a carrier of the type) by a set of proxies representing respective instances of 
the type. 

Definition 1. Type reduct A signature reduct of a type T is defined as a subsigna- 
ture i7j, of type signature Ej that includes a carrier V^, a set of symbols of operations 
(9'j. c Dj, a set of symbols of invariants Ij. 

This definition from the signature level can be easily extended to the specification 
level so that a type reduct can be considered a subspecification (with a signature 
2j.) of specification of the type T. The specification of R^ should be formed so that Rj. 
becomes a supertype of T. We assume that only the states admissible for a type re- 
main to be admissible for a reduct of this type (no other reduct states are admissible). 
Therefore, the carrier of a reduct is assumed to be equal to the carrier of its type. 

Definition 2. Type [/ is a refinement of type T iff 

• there exists a one-to-one correspondence Ops'. Oj <-A O^; 

• there exists an abstraction function Abs: Vj. that maps each admissible state 

of U into the respective state of T; 
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• (IJx) Ij. (Abs(x))); 

• for every operation o e (9j.the operation Ops(o) = o' is a refinement of o. To 
establish an operation refinement it is required that operation precondition pre(o) 
should imply the precondition pre(o') and operation postcondition post(o') should 
imply postcondition post(o). 

Based on the notions of reduct and type refinement, a measure of common infor- 
mation between types in the type lattice T can be established. Subtyping is defined 
similarly to the refinement, hut Ops becomes an injective mapping. 



Definition 3. Type meet operation. An operation n Tj produces a type T as an 
'intersection' of specifications of the operand types. Let = <V„, /„>, Tj = <V.^, 

/ j 2 >, then T = <V.p O^, Ij> is determined as follows. (9^, is produced as an inter- 
section of Oj, and Oj^ formed so that if two methods - one from 0„ and another one 
from - are in a refinement order, then the most abstract one is included in O^, 1^. = 
/„ T is positioned in a type lattice as the most specific supertype of 7j and and 
a direct subtype of all common direct supertypes of the meet argument types. 

If one of the types 7j or Tj or both of them are non-object types then the result of meet 
is a non-ohject type^. Otherwise it is an object type. If Tj (TJ is a subtype of Tj (TJ 
then Tj fTjj is a result of the meet operation. 

Definition 4. Type join operation. An operation T, cv Tj produces type T as join' of 
specifications of the operand types. Let T, = <V„, 0„, /„>, T^ = <y„, Oj^, Ij^>, then 
T = <Vj, Oj, Ij> is determined as follows. Oj is produced as a union of (9„ and 
formed so that if two methods - one from 0„ and another one from - are in a 
refinement order, then the most refined one is included in (9^,, /j. = /„ & I.^. T is posi- 
tioned in a type lattice as the least specific subtype of 7j and and a direct supertype 
of all the common direct subtypes of the join argument types. 

If one of the types T, or T^ or both of them are object types then the result of join is an 
object type. Otherwise it is a non-object type. If Tj (T,j is a subtype of T^ (TJ then 
(Tj) is a result of a join operation. 

Operations of the compositional calculus form a type lattice [10] on the basis of a 
subtype relation (as a partial order). In the SYNTHESIS language the type composi- 
tion operations are used to form type expressions that in a simplified form look as 
follows: 

<type expression>::= <type term>[<compositional operationxtype expression>] | 
(<type expression>) 

<compositional operation>::= n | u 

<type term>::= <type variable> | <type designator> | <function designator> 

<type designator>::= <type name> | <attribute name> | <reduct> 

<reduct>::= <type name>[<attribute name list>] 

<attribute name>::= <identifier> [/<type name>] [:<attribute path expression>] 
<attribute path expression>::= <identifier>[.<identifier>]. . . 



^ If a result of type meet or join operation is an object type then an instance of the resulting 
type includes a self walue taken from a set of not used OID values. 
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Reduct T[b/S] where b is an attribute of type T and 5 is a supertype of a type of the 
attribute b, denotes a type with the attribute b of type S. Reduct T[a:b.c] where a is an 
identifier, b is an attribute of type T, type of the attribute b is S, c is an attribute of S 
and type of the attribute c is U, denotes a type with the only attribute a of type U. 
Type expressions are required to type variables in formulae: 

<typed variable>::= <variable>/<type expression> 



3.2 Mediator Schema Example 

For the subject mediator example a Cultural Heritage domain is assumed and a re- 
spective mediated schema is provided (table 1, table 2). Attribute types that are not 
specified in the table should be string, text {tide, content, description, generaljnfo) or 
time (various dates are of this type). Text type is an ADT providing predicates used 
for specifying basic textual relationships for text retrieval. Time type is an ADT pro- 
viding temporal predicates. Mentioning text or time ADT in examples means their 
reducts that will be refined by respective types in the sources. These types require 
much more space to show how to treat them properly. Therefore in the examples we 
assume texts to be just strings and dates to be of integer type. In the schema value is a 
function giving an evaluation of a heritage entity cost {+!- marks input/output parame- 
ters of a function). 



Table 1. Classes of the mediated schema 



Class 


Subclass of 


Class inst. type/ Type 


Subtype of 


heritage_entity 

painting 

sculpture 

antiquities 

museum 

creator 


heritage_entity 

heritage_entity 

heritage_entity 


Heritage _Entity 

Painting 

Sculpture 

Antiquities 

Repository 

Creator 


Entity 

Heritage_Entity 

Heritage_Entity 

Heritage_Entity 

Person 



Table 2. Types of the mediated schema 



Type 


Attributes 


Entity 


title, date, createdjby: Creator, 

value: jin: function; params: j +e/Entity[title, n/created_by.name],- 


Heri- 


v/real}} 


tage_Entity 


place_of_origin, date_of_origin, content, in_collection: Collection, 
digital Jorm:Digital_Entity 


Painting 


dimensions 


Sculpture 


material_medium, exposition_space:( sequence; type_of_element; inte- 


Antiquities 


gerj 


Repository 


type_specimen, archaeology 


Collection 


name, place, collections; (set -of; Collection} 
name, location, description, in_repository; Repository, 


Person 


contains: j set-of: Heritage_Entity j 


Creator 


name, nationality, date_of_birth, date_of_death, residence 
culture, general_Info, works: (set-of: Heritage_Entityj 
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4 Subset of the SYNTHESIS Query Language 

In the paper a limited subset of the SYNTHESIS query language oriented on repre- 
sentation of union of conjunctive queries under SPJ set semantics is experienced. To 
specify query formulae a variant of a typed (multisorted) first order predicate logic 
language is used. Predicates in formulae correspond to collections (such as sets and 
bags of non-object instances), classes treated as set subtypes with object-valued in- 
stances and functions. ADTs of instance types of collections and classes are assumed 
to be defined. Predicate-class (or predicate-collection) is always a unary predicate (a 
class or collection atom). In query formulae functional atoms^ corresponding to func- 
tions F syntactically are represented as n-ary predicates F(X,Y) where Z is a sequence 
of terms corresponding to input parameters ig, (/„ is an input parameter typed by 
an ADT (or its reduct) that includes F as its functional attribute (method); q, ..., q - 
input parameters having arbitrary types (r >0)) and Fis a sequence of typed variables 
having arbitrary types (s > 1) and corresponding to output parameters o,, ...,o,. For 
terms the expressions (in particular cases - variables, constants and function designa- 
tors) can be used. Each term is typed. 

Definition 5. SYNTHESIS Conjunctive Query (SCQ, also referred as a rule) is a 
query of the form q(v/TJ:- C/v/TJ, ..., CpJTJ, F/X,,YJ, ..., FJX^JJ, B where 
q(v/TJ, Cfvj/T^j), CfvJT^J are collection (class) atoms, FfXj,Yj), .... FJX^^,YJ 
are functional atoms, B, called constraint, is a conjunction of predicates over the vari- 
ables V, Vj, v„, typed by Tj, T^,, T^, or output variables YjUY^U ...U Y^ of func- 

tional atoms. Each atom Cjv^/TJ or FfXj,Y.) (i = 1, .... n; J = 1, m) is called a 
subgoal. The value v structured according to T is called the output value of the query. 
A union query is a finite union of SCQs. Atoms Cjv^/TJ may correspond to inten- 
sional collections (classes) that should be defined by rules having the form of SCQ"*. 

The SPJ set semantics of SCQ body is introduced further^. General schema of cal- 
culating a resulting collection for a body of SCQ C/Vj/T^j), CJv^/f J, FfX^,YJ, 

FJX^,YJ, B is as follows. First, Cartesian product of collections in the list is cal- 
culated. After that for each instance obtained functional predicates are executed. A 
method to be executed is determined by a type of the first argument of a functional 
predicate. A type of the resulting collection of a product is appended with the attrib- 
utes of the output arguments for each function in SCQ. Each instance of the resulting 
collection of a product is appended with the values of the output arguments for each 
function in SCQ calculated for this particular instance. After that those instances of 
the resulting collections are selected that satisfy B. After that joins of product domains 
are undertaken. We assume that such joins are executed in the order of appearance of 
the respective collection predicates in the SCQ body, from the left to the right. The 
semantics of SCQ conjunctions Cjv/TJ, Cfv./TQ are defined by the instance types 



^ State-based and functional attributes are distinguished in type definitions. Functional attrib- 
utes are taken (similarly to [6]) out of the instance types involved into the formula to show 
explicitly and plan the required computations. 

To make presentation more focused, in the paper everywhere non-recursive SCQs and views 
are assumed. 

^ For the bag semantics additionally it is required to ensure that the multiplicity of answers to 
the query are not lost in the views (applying set semantics), and are not increased. 
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of the arguments resulting in a type defined by join operation of the specifications of 
the respective argument types. Formal semantics of SCQ are given in Appendix. 

The semantics of disjunctions Cjv/TJ v Cjv./TJ requires that for T . and a re- 
sulting type of disjunction is defined by type operation meet and the disjunction 
means a union of C, and C,. If the result of meet is empty then the disjunction is unde- 
fined. Note that in atom Cjv/TJ for T, any reduct of C. instance type may be used. 
This leads to a “projection” semantics of Cjv/TJ. 

Under such interpretation, SCQ (or a rule) is safe if the instance type of the SCQ 
(rule) head is a supertype of the resulting type of the SCQ (rule) body. Such resulting 
type may include also attributes equated (explicitly or implicitly) by 5 to a variable or 
to a constant. “Implicitly” may mean that such attribute is equated to output argument 
of a function. 

Two SCQs q,(Vj/T^J and are said to be comparable if is a subtype of 

Let and q^ be two comparable queries, q^ is said to be contained in q^, denoted 
q, ^ q^, if for any database instance, all of the answers to q, after their transformation 
to type are answers to q^. 



5 Sources Registration and Inverse Rules 

During the registration a local source class is described as a view over virtual classes 
of the mediator having the following general form of SCQ. 

V(h/rj ^P/b/TJ, ..., Pfb/TJ, F/X„Y,), ..., F/X,YJ, B 

Here U is a source class, Pj, are classes of the mediator schema®, P„, .... P^,are 

functions of the mediator schema, P is a constraint imposed by the view body. This 
SCQ should be safe. The safety property is established during the registration process. 
Due to that one-to-one correspondence between attributes of a reduct of the resulting 
instance type of a view body (mediator) and the instance type of a view head (source) 
is established. On source registration [1] this is done so that a reduct of the view body 
instance type is refined by the concretizing type designed above the source. Since 
the open world assumption is applied, each class instance in a view may contain only 
part of the answers computed to the corresponding query (view body) on the mediator 
level. To emphasize such incompleteness, a symbol ^is used to interconnect a head 
of the view with its body. 

For the local sources of our Cultural Heritage domain example few schemas are as- 
sumed for Louvre and Uffizi museum Web sites. Several view definitions for these 
sources registered at the mediator follow. In the views it is assumed that reducts of 
their instance types refine reducts of the respective mediator classes instance types or 
their compositions. Details on that can be found in [1]. The same is assumed for at- 
tribute types having the same names in a view and in the mediator. 



® These atoms may also correspond to intensional classes defined by rules in the mediator 
schema. 
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Uffizi Site Views 

canvas(p/Canvas[title, name, culture, place_of_origin, r_name]} c 
painting(p/Painting[title, name: created_by.name, place_of_origin, date_of_origin, r_name: 
in_collection.in_repository.name]), creator(c/Creator[name, culture]}, r_name = 'Ujfizi', 
date_of_origin >= 1550, date_of_origin < 1700 

artist (a/Artist]name, general_Info, works]) ^ creator) a/Creator]name, general_Info, 
works/ f set-of: Painting ]]) 



Louvre Site Views 

workP(p/Work] title, author, place_of_origin, date_of_origin, in_rep]) ^ 
painting(p/Painting]title, author: created_by.name, place _of_origin, date_of_origin, in_rep: 
in_collection.in_repository.name]), in_rep = ‘Louvre’ 

workS(p/Work]title, author, place_of_origin, date_of_origin, in_rep]) ^ 
sculpture(p/Sculpture]title, author: created_by.name, place_of_origin, date_of_origin, in_rep: 
in_collection.in_repository.name]), in_rep = ‘Louvre’ 

To produce inverse rules out of the mediator view definitions as above, first, replace 
in the view each not contained in attribute from with a distinct Skolem 

function of h / producing output value of the type of the respective attribute. Such 
replacing means substitution of the attribute get function in a type with a method 
defined as a Skolem function that can not be expressed in terms of a local source. In 
the text Skolemized attributes are marked with #. All Skolemized attributes are added 
to the type T^. Such Skolemizing mapping of the view is denoted as p. After the 
Skolemizing mapping, inverse rules for the mediator classes in the view body are 
produced as p(Plb/TJ e— V(h/TJ) (for i = 1, ..., ky. It is assumed that types 7)^^ and 

are defined as [a^ / Tpt^, ..., / T^:tJ and TJUj / Sj, ..., a^/SJ so that is a 

supertype of and T. is a supertype of S^, i = 1, ..., n. 

For the mediator functions being type methods the inverse rules look like p([m/T^J 
^ for j = 1, ...,r, here F,. and are methods of and 

7 such that type of function of refines type of function of Types and 7 
(defined for the mediator and a source respectively) may be given implicitely in the 
view definition and (or) deduced from the registration information. Functional inverse 
rules are obtained during the source registration at the mediator, not from the view 
body. 

p(BJ will be called the inferred constraint of the view predicate V(h/TJ. The in- 
verse rules generated from different views must use different Skolem functions (in- 
dexing of # will denote this). 



’’ If 7j. is an object type then common reduct of 7^, and 7^ should be such that ie/f attribute of 
7j_. has an interpretation in 7^, otherwise self is to be replaced with a Skolem function generat- 
ing OIDs. 
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Inverse Rules for canvas View of Uffizi 

painting(p/Painting[title, name: created_by.name, place_of_origin, # ylate_of_origin, r_name: 
in_collection.in_repository.name]) canvas(p/ Canvas[title, name, culture, place_of_origin, 
#flate_of_origin, r_name]} 

creator(c/ Creatorfname, culture]) y— canvas(p/ Canvas]title, name, culture, place_of_origin, 
r_name]) 

The inferred constraint for canvas(p/Canvas]title, name, culture, place _of_origin, 
# ylate_of_origin, r_name])\ 

r_name = 'Ujfizi', #,date_of_origin >= 1550, #^date_of_origin < 1700 

During Uffizi registration a functional inverse rule is registered as 

value] h/Entity]title, name: created_by.name], v/real) (— amount(h/Entity]title, name: 
created_by.name], v/real) 



Inverse Rules for workP and workS Views of Louvre 

painting(p/Painting]title, author: created_by.name, place _of_origin, date_of_origin, in_rep: 
in_collection.in_repository.name]) <— workP(p/Work]title, author, place _of_origin, 

date_of_origin, in_rep]) 

The inferred constraint for workP: in_rep = ‘Louvre ’ 

To save space, similar rule for sculpture is not shown here. During Louvre registration 
a functional inverse rule is registered as 

value] h/Entity]title, name: created_by.name], v/real) <— amount]h/Entity]title, name: cre- 
ated_by.name], v/real) 

Given a union query defined over the mediator classes, collections and functions, 
our task is to find a query defined solely over the view classes, collections and 
functions such that, for any mediator database instance, all of the answers to com- 
puted using any view instance are correct answers to Q^. should be a contained 
rewriting of Q^. There may be many different rewritings of a query. 



6 Query Rewriting 

Let be a union mediator query to be rewritten. Without loss of generality, all the 
SCQs in are assumed to be comparable. Similarly to [17], the method for rewriting 
consists of two steps. In the first step, we generate a set of candidate formulae 
(candidates for short) which may or may not be rewritings. These candidates are gen- 
erated separately for every SCQ in In the second step, all these candidates are 
checked to see whether correct rewritings can be obtained. A set I of compact inverse 
rules is assumed to be obtained for various sources as a result of their registration at 
the mediator [1]. 

For each SCQ q]v/TJ:- C,]v,/TJ, ..., CJvJTJ, F,]X„Y,), ..., FJX^,YJ, B in 
denoted as Q do the following. 
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For each subgoal C/ v. / TJ or F.(X., Y.) of Q find inverse rule r e I with the head 
Pi(b^/TJ or such that = P, (or P, is a name of any transitive subclass 

of Cj) and P^, is a subtype of P„. (or F. = and function type is a refinement of P 
type). Further such discovery is called subgoal unification. 

A destination of Q is a sequence D of atoms P/hj/P^^), PfbJT^^J, 

FiJ^Xi^Y^J obtained as a result of the query subgoals unification with the heads of 
inverse rules from I. Several destinations can be produced as various combinations of 
SCQ subgoals unifications found. Additionally each destination should conform to the 
following constraints: 

1. There is no j such that a constant in X. of FfX.,Yf (j=l, of Q corresponds to a 
different constant in the same argument of the respective functional subgoal of 
destination FJXi^., Y^.). 

2. No two occurrences of the same variable or of the same function in P (j=l, ..., m) 
of Q correspond to two different constants in P^^ (j=l, ■■■, m) of D, and no two oc- 
currences of the same variable or of the same function in the same head of a rule 
correspond to two different constants in Q. 

Once a destination D of Q is found, we can use it to construct a candidate formula as 
follows. For each atom P.(b/TJ or FJXi^.,Yi^.) in D (supposing it is a head of the in- 
verse rule P,(b./TJ <- V,(h,/TJ resp. FJX^jJ^J) do the following (if 

there are rules that have the same head but different bodies, then choose one of them 
in turn to generate different candidates). 

Establish a mapping (j). of attributes and variables in the atom P.(b/TJ of D and in 
the associated atom V(h./TJ to the attributes and variables of the respective atom 
C.(v/TJ of Q. For each variable z in P„ of V(h/TJ which does not appear in P.(b./TJ 
as a free variable, let (j). map z to a distinct new variable not occurring in Q or any 
other view atom <j)fVfh./T^f), (i ^j). Free variables in the atom P.(b/TJ of D and its 
associated atom V(h/TJ are mapped to the respective atom of Q as follows. For the 
atom Pi(b^/TJ of D and the associated atom V(h./TJ the mappings of h, and h. to v, 
are added to For all T^. attributes do the following. If an attribute a does not belong 
to P^, then add to (j). the mapping of a to an empty attribute (i.e., remove the attribute). 
If an attribute a belongs to P^, but it has the form a/R where P is a supertype of a type 
of the attribute a, then add to the mapping of a to a/R or to a/R:#a if a is a Skolem 
attribute. If an attribute a of the type T^. contains an attribute of the form b/R:t.s and 
the type contains an attribute of the form a/T:t (where t and s are attribute path 
expressions, P is a supertype of P and P is a supertype of a type of a) then add to (j). 
the mapping of a to b/R:a.s or to a/R:#a if a is a Skolem attribute. Similarly we build 
a mapping (Zi of attributes and variables in the atom F^X^^., Y^^f of D and in the associ- 
ated atom FJXi^j,Yi^j) to the attributes and variables of the respective atom F.(X,Y.) 
ofQ. 

For each destination and variable mappings defined, construct a formula <P. 

^,(P,(b/TJ), ..., (PfPfbJTJ), tJF,fX,,YJ), ..., tJFJVJ) (^) 

Construct the mapping of a constraint of g to a constraint in <P. Let S^ = ( a^, a^, 
aj and = (c,, c^, ..., cj be the sequences of function arguments in Q and re- 
spectively. The mapping <Jis constructed as follows. 
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Initially, an associated equality of the constraint in Eg = True. For i = 1 to m: 

1. If a. is a constant or or a function that results in a, but c. is a variable y, then let Eg= 
Eg/\(y = (X). or should be of y type or any of its subtypes. 

2. If a^ is a variable x, and x appears the first time in position i, then let <Jmap x to c.. 
If X appears again in a later position j > i of S^, and c,. Cj, then let Eg = Eg a (c. = 
Cj). a,, c, types and a^, Cj types are assumed to be the same or in a subtyping order. 
We shall get SCQ: 



Replace heads of the inverse rules in the above SCQ with the rules bodies to get the 
formula 



If the constraint d(B) a Eg and the inferred constraints of the view atoms in the candi- 
date formula are consistent and there are no Skolem functions in the candidate of Q 
then the formula is a rewriting (remove duplicate atoms if necessary). If there are 
Skolem functions in 5(B) a Eg, then the candidate formula is not a rewriting because 
the values of Skolem functions in d(B) a Eg can not be determined. Note that x=y in 
the constraint for terms x and y typed with ADTs T and S is recursively expanded as 
x.a=y.aj & ... & x.a,,=y.a^ where a,, ..., a„ are common attributes of T and S. 

Containment property of the candidate formulae. The candidate formula (0^) has the 
following property. If we replace each view atom with the corresponding Skolemized 
view body and treat the Skolem functions as variables, then we will get a safe SCQ Q’ 
(the expansion of the candidate formula (0^)) which is contained in Q. This is because 
(0j) is a safe SCQ which is equivalent to (Q), and all subgoals and built-in predicates 
of (0,) are in the body of Q’ (this is a containment mapping). We constructed 0^ so 
that for any collection (class) subgoal pair in 0^ and Q an instance type of a subgoal 
of 02 is a refinement of the instance type of the respective subgoal of Q, for any func- 
tional subgoal pair in 0^ and Q a type of function of a subgoal of 0^ is a refinement of 
type of function of the respective subgoal of Q. In some cases it is possible to obtain 
rewritings from the candidate formulae eliminating Skolem functions [17]. If the 
inferred constraints of the view atoms imply the constraints involving Skolem func- 
tions in the candidate formula, then we can remove those constraints directly. 

Consistency Checking 

Main consistency check during the rewriting consists in testing that constraint d(B) a 
Eg together with the inferred constraints of the view atoms in a candidate formula are 
consistent. Here we define how it can be done for the arithmetic constraints following 
complete algorithm for checking implications of arithmetic predicates [16]. 



q(v/T).- <p,(P,(b,/TJ), ..., tpjPJbJTJ), (p„JEJXM), .... 
^nJFJXMh S(B), Eg 



(®,) 
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1. Assuming that in SCQ we can apply only arithmetic predicates, form Arith = d(C) 
A A <inferred constraints of the view atoms in a candidate formula <P>. 

2. For a candidate formula 0 it is required to show that there exist correct substitu- 
tions of type attributes and function arguments* in satisfying AnY/;. 



7 Query Rewriting Example in a Subject Mediator 

Rewrite the following query to the Cultural Heritage domain mediator: 

valuable_Italian_herilage_entities(h/Heritage_Entity_Valued[title, c_name, r_name, v]) 
heritage_entity(h/Heritage_Entity[ title, c_name:created_by.name, place_of_origin, 
date_of_origin, r_name: in_coUection.in_repository.name]), value] h/ Heritage_Entity [title, 
name: c_name], v/real), v >= 200000, date_of_origin >= 1500, date_of_origin < 1750, 
place_of_origin = ‘Italy’ 

Destinations Obtained 

For Uffizi site heritage _entity subgoal of a query unifies with painting as a heri- 
tage_entity subclass. The first destination is obtained as: 

painting(p/Painting[title, name: created_by.name, place_of_origin, #jdate_of_origin, r_name: 
in_collection. in_repository. name ]), value] h/Entity[ title, name: 
created_by.name], v/real) 



Mapping for the Destination (only different name mappings are shown): 



(j), mapping 


p ^ h, #^date_of_origin — >■ date_of_origin:#^date_of_origin, 
r_name c_name: r_name 


(j)^ mapping 


name: created_by_name — > name:c_name 



For the query constraint S is an identity mapping and E =true. Applying 
canvas] p/Canvas] title, name, culture, place _of_origin, r_name])), 

amount] h/Entity[ title, name: created_by.name], v/real)), v >= 200000, date_of_origin > = 
1500, date_of_origin < 1750, place_of_origin = ‘Italy’ 

we get the candidate formula 

valuable_ltalian_heritage_entities]h/Heritage_Entity_Valued[title, c_name, r_name, v[) :- 
canvas]h/Canvas[ title, name, culture, place _of_origin, date_of_origin: #^date_of_origin, 
c_name: r_namej), amount]h/Entity[title, name: c_name], v/real), v >= 200000, 

date_of_origin >= 1500, date_of_origin < 1750, place_of_origin = ‘Italy’ 

For Louvre the heritage _entity subgoal of a query unifies with painting, sculpture as 
heritage _entity subclasses. Only destination formed for painting is shown here. This 
second destiniation is obtained as: 



It follows that an ability to compute functions during the consistency check to form admissi- 
ble combination of input - output ai'gument values is required. 
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painting(p/Painting[title, author: created_by.name, place _of_origin, date_of_origin, in_rep: 
in_collection.in_repository.name]), value] h/Entity[ title, name: created_by.name], v/real) 

(j) Mapping for the Destination: 



mapping 


authors c_name: author, in_rep — > r_name:in_rep 


mapping 


name :created_by. name name:c_name 



Again, 6 is an identity mapping and E =true. Finally we get the second candidate 
formula 

valuable_Italian_heritage_entities(h/Heritage_Entity_Valued[title, c_name, r_name, v]) :- 
workP(hAVork[ title, c_name: author, place_of_origin, date_of_origin, r_name: in_repj), 
amount(h/[title, name: c_name], v/real), v >= 200000, date_of_origin >= 1500, 
date_of_origin < 1750, place_of_origin = ‘Italy’ 



Obtaining Rewritings from Candidate Formulae 

To retrieve a rewriting we eliminate Skolem functions from the first candidate for- 
mula.. Note that the inferred constraint for canvas(h/Canvas[title, c_name: name, 
culture, place_of_origin, date_of_origin: #jdate_of_origin, r_name]) that looks as 
r_name = 'Ujfizi', #,date_of_origin >= 1550, #,date_of_origin < 1700 implies 
date_of_origin >= 1500, date_of_origin < 1750 for Uffizi. Due to that Skolem func- 
tions can be eliminated from this candidate formula and after the consistency check 
we get the following rewriting: 

valuable_ltalian_heritage_entities(h/Heritage_Entity_Valued[title, c_name, r_name, v]) :- 
canvas(h/ Canvasftitle, c_name: name, culture, place_of_origin, r_name]), amount(h/ [title, 
name: c_name], v/real), v > = 200000, place_of_origin = ‘Italy’ 

The second candidate formula is a correct rewriting without any transformation. 



Conclusion 

The paper presents a query rewriting method for the heterogeneous information inte- 
gration infrastructure formed by a subject mediator environment. LAV approach treat- 
ing schemas exported by sources as materialized views over virtual classes of the 
mediator is considered as the basis for the subject mediation infrastructure. Main 
contribution of this work consists in providing an extension of the query rewriting 
approach using views for the typed environment of subject mediators. Conjunctive 
views and queries are considered in frame of an advanced canonical object model of 
the mediator. The “selection-projection-join” (SPJ) conjunctive query semantics 
based on type specification calculus has been introduced. The paper shows how the 
existing query rewriting approaches can be extended to be applicable in such typed 
framework. The paper demonstrates that refinement of the mediator class instance 
types by the source class instance types is the basic relationship required for query 
containment in the typed environment to be established. The approach presented is 
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under implementation for the subject mediator prototype [1]. This implementation 
creates also a platform for providing various object query languages (e.g., a suitable 
subset of OQL (ODMG) or SQL: 1999) for the mediator interface. Such languages can 
be implemented by their mapping into the canonical model of the mediator. 

In a separate paper it is planned to show how an optimized execution plan for the 
rewritten query is constructed under various limitations of the source capabilities. 
Future plans include also extension of the query rewriting algorithm for frame-based 
semi- structured (XML-oriented) queries as well as investigations for queries (views) 
with negation and recursion. 
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Appendix. Formal Semantics of SYNTHESIS Conjunctive Query 

Semantics of SCQ {q(v/TJ:- C,(v,/TJ, CJvJTJ, F,(X„YJ, FJX^,YJ, B 
where q(v / TJ, C/Vj/T^J, ..., CJv^/T^J are collection (class) atoms, Fj(Xj,YJ, .... 
FJX^,YJ are functional atoms, B is a conjunction of predicates over the variables v, 
V,, .... vjare given by a semantic function s[-] constructing a result set of SCQ body. 
s[-] is defined recursively starting with the semantics of collection atoms. Collection 
C is considered as a set of values of type T .. Any value of type T . is an element of the 
extent of type T^. Thus a result set s[CJv/TJ] of collection atom CJv/TJ is a 
subset of the extent 

The first stage of constructing of the result set of the SCQ body is as follows. Con- 
struct a Cartesian product of sets c.=s[CJv/TJ], append elements corresponding to 
the values of output parameters of functions F/X^,YJ to the tuples of the product and 
select all the tuples satisfying predicate B. Semantic function ccp[-] (conditional Car- 
tesian product) is provided for that: 

ccp[C,(v/TJ, ..., CJvJTJ, F/X,YJ, .... FJX^YJ, B] = 

(Vj,..., vj G C,X...XC„ A 
C ^ '^RI ^ ^ ^ C ^ '^Rm ^ ^ 

y/‘^Ci-y/‘.-. yJ^Q-yJ,-. y/- ^ Q-yJ-} 

I 

cYqh a formula defining values of output parameters of F. in a tuple. To define for- 
mally what c5^is, it is required to make the following assumptions. LetZ, and Y. be 

x= V ., x',..., xr 

Y, = y!.-.y‘ 

Let B. be a type of the structure of the output parameters of the method F . 

{ Rr, in: type; y/ : y/V- } 

Let method F. has input parameters a' /U af/UT, output parameters b‘/W‘,..., 
bf‘/Wf' and predicative specification/ 

Let Q,,..., 2, be all subtypes of the type of the variable - type T^„yi. Let/,...,/,, 
be predicative specifications of the method F, for the types Q,,..., respectively. 

Then formula (taking into consideration a polymorphism of the method F, ) looks 
as follows. 
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v„, ^ Vqi E fjthis^v^., 
U, ^ ^Q, ^fr,i{this^v„yi. 



ar-.xr, A... A 

ar^xr, b/^C,y/..... bf‘^C,yf} 



A notation //«—>?/ where /is a formula, a is a variable of / f is a term, means the 
formula/with a substituted by t. 

The second stage of the construction of the result set is a calculation of joins of 
product domains. The calculation of a single join is performed by semantic function 
sjoin. It takes a set of r-tuples s with types of elements 7j and produces a set of 
(r-1)- tuples with types of elements TjUT 2 , Ij,..., T. 



sjoin(t)= { 3 (X,...., XjEt, vE ( v=n^i l^i= 

^1 = X3A...A^^= X,) j 



For every tuple from t the function sjoin “glues” first two elements XjE XzE Vt 2 

of the tuple into one element /j^E V„uT 2 - As value of type 7j, /j^has all the attributes 
of the type 7j and values of these attributes are the same as values of respective attrib- 
utes of Xj. As value of type T^, fJ, has all the attributes of the type /, and values of 
these attributes are the same as values of respective attributes of l 2 - Equality of values 
of attributes is expressed by the following notation. 

v=jW ~v.dj=QjW. djA ... Av.dg=QgW. dg 

Type There has attributes di,...,dgOf types g;,...,gg respectively. 

To perform all joins for the product 

ccp[C/v/rj Cjv/rj, F/XgYJ, ..., FJX^,YJ. B] 

it is required to apply sjoin function n+m-1 times. 

In case when all types ..., are nonobject types, the type of the result set is 
nonobject and the semantic function s provided for producing the result set of SCQ 
right-hand part C/v/T^J, ..., CJv/T^J, F/X^,Yj), ..., FJX^,YJ, B is defined as fol- 
lows. 

4C/V. / •••, Q (v„ / ?;j, E/A, t; (X,,. , X 5] = 

sjoin(sjoin{ . . . sjoin (ccp[C^ (v, C„ (v„ / ), Tj (A, T; (A„ ,¥^^,8])... )) 

n+m-1 times 

In case when at least one type of ..., is an object type, the type of the result set 
is object and the semantic function s is defined as follows. 



s[C (v / /_ ), ... ,C (v_ / ), / , T ), ... , T (X^ , F _ ), B] = 

objectifyi sjoin(sjoin( ... sjoin (ccp[C (v^ / ), ... ,C_ (v^ (^, . X ). . /, ). ^1) ..■ ))) 



Semantic function objectify here converts a collection of nonobject values into a col- 
lection of objects. This is done by adding an attribute jeZ/ obtaining some new unique 
identifier to each value of the nonobject collection. 
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Abstract. As web information systems (WIS) tend to become large, it 
becomes decisive that the underlying application story is well designed. 
Such stories can be expressed by a process algebra. In this paper we 
show that such WIS-oriented process algebras lead to many-sorted 
Kleene algebras with tests, where the sorts correspond to scenes in the 
story space. As Kleene algebras with tests subsume propositional Hoare 
logic, they are an ideal candidate for reasoning about the story space. 
We show two applications for this: (1) the personalisation of the story 
space to the preferences of a particular user, and (2) the satisfaction of 
particular information needs of a WIS user. 

Keywords: Web information system, Kleene algebra, process algebra, 
personalisation, navigation 



1 Introduction 

A web information system (WIS) is a database-backed information system that 
is realized and distributed over the web with user access via web browsers. In- 
formation is made available via pages including a navigation structure between 
them and to sites outside the system. Furthermore, there should also be opera- 
tions to retrieve data from the system or to update the underlying database(s). 

Various approaches to develop design methods for WISs have been proposed 
so far. The ARANEUS framework [1] emphasises that conceptual modelling of 
web information systems should approach a problem triplet consisting of con- 
tent, navigation and presentation. This leads to modelling databases, hypertext 
structures and page layout. Other authors refer to the ARANEUS framework. 
The work in [2] addresses the integrated design of hypermedia and operations, 
but remains on a very informal level. Similarly, the work in [4] presents a web 
modelling language WebML and starts to discuss personalisation of web infor- 
mation systems and adaptivity, but again is very informal. 
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The OOHDM framework [14] emphasises an object layer, hypermedia com- 
ponents and an interface layer. This is more or less the same idea as in the work 
of the ARANEUS group except that OOHDM explicitly refers to an object ori- 
ented approach. The approach in [5] emphasises a multi-level architecture for the 
data-driven generation of WISs, personalisation, and structures, derivation and 
composition, i.e. it addresses almost the same problem triplet as the ARANEUS 
framework. 

Our own work in [7,16] emphasises a methodology oriented at abstraction lay- 
ers and the co-design of structure, operations and interfaces. Among others this 
comprises a theory of media types, which covers extended views, adaptivity, hier- 
archies and presentation style options. This theory is coupled with story boarding, 
an activity that - roughly speaking - addresses the design of an underlying ap- 
plication story. As soon as WISs become large, it becomes decisive that such an 
underlying application story is well designed. 

Application stories can be expressed by some form of process algebra. That 
is, we need atomic activities and constructors for sequencing, parallelism, choice, 
iteration, etc. to write stories. The language SiteLang [6] is in fact such a pro- 
cess algebra for the purpose of storyboarding. In addition to the mentioned 
constructors it emphasises the need for modelling scenes of the story, which can 
be expressed by indexing the atomic activities and using additional constructs 
for entering and leaving scenes. 

In this paper we show that such WIS-oriented process algebras lead to many- 
sorted Kleene algebras with tests, where the sorts correspond to scenes in the 
story space. Kleene algebras (KAs) have been introduced in [10] and extended 
to Kleene algebras with tests (KATs) in [12]. In a nutshell, a KA is an algebra 
of regular expressions, but there are many different interpretations other than 
just regular sets. A KAT imposes an additional structure of a Boolean algebra 
on a subset of the carrier set of a Kleene algebra. 

If we ignore assignments, KATs can be used to model abstract programs. Do- 
ing this, it has been shown in [13] that KATs subsume propositional Hoare logic 
[9]. This subsumption is even strict, as the theory of KATs is complete, whereas 
propositional Hoare logic is not. Therefore, we consider KATs an ideal candidate 
for reasoning about the story space. In this paper we show two applications for 
this: 

— the personalisation of the story space to the preferences of a particular user, 

and 

— the satisfaction of particular information needs of a WIS user. 

We will use the on-line loan application example from [3,15] to illustrate 
these applications in the practically relevant area of electronic banking. 

In Section 2 we briefly introduce storyboarding and discuss process algebra 
constructs that are needed to reason about storyboards. In Section 3 we explain 
KATs and our version of many-sorted KATs, as they arise from storyboarding. 
Then, in Section 4 we address the reasoning with KATs and demonstrate the 
two applications. We conclude with a short summary. 
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2 Storyboarding in Web Information Systems 

As WISs are open in the sense that anyone who has access to the web could 
become a user, the design of such systems requires some anticipation of the 
users’ behaviour. Storyboarding addresses this problem. Thus, a storyboard will 
describe the ways users may choose to navigate through the system. 

2.1 Scenario Modelling 

At a high level of abstraction we may think of a WIS as a set of abstract locations, 
which abstract from actual pages. A user navigates between these locations, and 
on this navigation path s/he executes a number of actions. We regard a location 
together with local actions, i.e. actions that do not change the location, as a unit 
called scene. 

Then a WIS can be decribed by an edge-labelled directed multi-graph, in 
which the vertices represent the scenes, and the edges represent transitions be- 
tween scenes. Each such transition may be labelled by an action executed by the 
user. If such a label is missing, the transition is due to a simple navigation link. 
The whole multi-graph is then called the story space. 

Roughly speaking, a story is a path in the story space. It tells what a user of 
a particular type might do with the system. 

The combination of different stories to a subgraph of the story space can 
be used to describe a “typical” use of the WIS for a particular task. Therefore, 
we call such a subgraph a scenario. Usually storyboarding starts with modelling 
scenarios instead of stories, coupled by the integration of stories to the story 
space. 

At a finer level of details, we may add a triggering event, a precondition and 
a postcondition to each action, i.e. we specify exactly, under which conditions 
an action can be executed and which effects it will have. Further extensions to 
scenes such as adaptivity, presentation, tasks and roles have been discussed in 
[3] and [7], but these extensions are not relevant here. 

Looking at scenarios or the whole story space from a different angle, we may 
concentrate on the flow of actions: 

— For the purpose of storyboarding, actions can be treated as being atomic, i.e. 
we are not yet interested in how an underlying database might be updated. 
Then each action also belongs to a uniquely determined scene. 

— Actions have pre- and postconditions, so we can use annotations to express 
conditions that must hold before or after an action is executed. 

— Actions can be executed sequentially or parallel, and we must allow (de- 
monic) choice between actions. 

— Actions can be iterated. 

— By adding an action skip we can then also express optionality and iteration 
with at least one execution. 

These possibilities to combine actions lead to operators of an algebra, which 
we will call a story algebra. Thus, we can describe a story space by an element 
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of a suitable story algebra. We should, however, note already that story algebras 
have to be defined as being many-sorted in order to capture the association of 
actions with scenes. 

2.2 Story Algebras 

Let us take now a closer look at the storyboarding language SiteLang [6], which 
in fact defines a story algebra. So, let S = {si, . . . , s„} be a set of scenes, and let 
A = {ai , . . . , Ofe} be a set of (atomic) actions. Furthermore, assume a mapping 
a : A ^ S, i.e. with each action a € A we associate a scene cr(a). 

This can be used to define inductively the set of processes V = V{A,S) 
determined by A and S. Furthermore, we can extend cr to a partial mapping 
P ^ S as follows: 

— Each action a G A is also a process, i.e. a £ P, and the associated scene 
<j{a) is already given. 

— skip is a process, for which cr(skip) is undefined. 

— If Pi and p2 are processes, then the sequence pi\P2 is also a process. Fur- 
thermore, if cr(pi) = (j{p2) = s or one of the pi is skip, then (r{pi,p2) is also 
defined and equals s, otherwise it is undefined. 

— If Pi and p2 are processes, then also the parallel process pi||p2 is a process. 
Furthermore, if cr(pi) = <j{p2) = s or one of the pi is skip, then a{pi\\p2) is 
also defined and equals s, otherwise it is undefined. 

— If pi and P2 axe processes, then also the choice PiOp2 is a process. Further- 
more, if cr(pi) = a{p2) = s or one of the pi is skip, then a{piOp2) is also 
defined and equals s, otherwise it is undefined. 

— If p is a process, then also the iteration p* is a process with <j{p*) = ct(p), if 
(j{p) is defined. 

— If p is a process and p is a boolean condition, then the guarded process {<p}p 
and the post-guarded process p{<p} are processes with cr({p}p) = cr{p{ip}) = 
cr(p), if (t(p) is defined. 

Doing this, we have to assume tacitly that navigation between scenes is also 
represented by an activity in A, and the assigned scene is the origin of the nav- 
igation. SiteLang provides some more constructs, which we have omitted here. 
Constructs such as non-empty iteration p+ and optionality [p] can be expressed 
by the constructs above, as we have p~^ = p;p* and [p] = pDskip. 

Furthermore, we deviated from the SiteLaing syntax used in [6] and [3]. For 
instance, SiteLang provides constructors and \ to enter or leave a scene, 
respectively. We simply use parentheses and make the associated scene explicit 
in the definition of cr. 

Parallel execution is denoted by [) and choice by p| in SiteLang, whereas 
here we use the more traditional notation. 

SiteLang also uses f\ to mark a parallel execution with a synchronisation 
condition which in our language here can be expressed by a post-guarded 
parallel process (. . . || . . . ){:p}. 
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Example 2 . 1 . Consider the loan application from [ 15 ]. A rough sketch of the 
story space can be described as follows: 

enter_loan_system ; 

( ( {v^o} look_atdoans_at_a_glance □ 

( {^i} request_homedoan_details ; 

( look_at_home_loan_samples □ skip ) ) □ 

( {v?2} request anortgage.details ; 

( look_at_mortgage_samples □ skip ) {<pi} ) )* {v^s} ) ; 

( selectdiomedoan {(fie} □ select anortgage {‘pj} ) ; 

( ( ( provide_applicant_details ; 

( provide_applicant_details □ skip ) ; 

( describeJoan_purpose || enter_amount_requested || 
enter_income_details ) ; 
select_hLterms_and_conditions ) 

( ( provide_applicant_details ; provide_applicant_details* ; 

( describe_object || enter_mortgage_amount || 
describe .securities* ) ; 

( enter_income_details || enter.obligations* ) ; 

( ( {“'V^i2} select jn_terms_and_conditions ; 
calculate.payments )* ; 

{'P12} select ju_terms_and_conditions ) ) {v?g} ) ) ; 
confirm.application {ipm V ipn} 

involving the conditions 

tfo = information.aboutdoan.typesjieeded 
tpi = information.about JiomeJoans_needed 
ip2 = information.about jnortgagesmeeded 
(/?3 = homedoans-known 
(/?4 = mortgagesJinown 
(/?5 = availableJoans_known 
ifQ = home Joan.selected 
ipj = mortgage_selected 
(^g = home Joan_application_completed 
(/9g = mortgage_application_completed 
ifiQ = applied_for_homeJoan 
<pii = applied_for_mortgage 
<pi2 = payment_options_clear 
The set of scenes is 5 = {si, . . . , sg} with 

Si = type.oOoan sg = applicant.details S3 = homeJoan.details 
S4 = homedoan.budget S5 = mortgage.details sg = securities.details 
S7 = mortgage.budget sg = confirmation sg = income 
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The set of actions is ^ = {ai, . . . , 020} using 

tti = enter Joan_system = look_at_loans_at_a_glance 
«3 = request_home_loan_details 04 = request_mortgage_details 
«5 = look_at_homedoan_samples = look_at_niortgage_samples 
a^ = select -homedoan ag = provide_applicant_details 
«9 = describe_loan_purpose aio = enter .amount .requested 
Oil = enter dncome.details 012 = selectJil.terms.and.conditions 
ai3 = select .mortgage 044 = describe .object 
Ofi5 = enter jnortgage.amount aie = describe.securities 
ai7 = enter.obligations Oig = select jn.terms.and.conditions 
ai9 = calculate.payments «20 = confirm.application 



Finally, we get the scene assignment cr with 



ct(q;i) = Si cr(a2) = Si 
a{ae) = si a{ar) = si 
c(aii) = S9 cr(ai2) = S4 
^■(aie) = se cr{air) = S7 



o-(o3) = Si 
(j{as) = S2 

a{ai3) = Si 
^■(ctis) = S7 



(7(04) = Si cr(a5) = Si 

^■(cKg) = S3 cr(aio) = S3 
0’(o:i4) = S5 17(015) = S5 
O’(oi9) = S7 cr(a2o) = sg 



As a consequence, the scene associated with the sub-process 

enter.loan.system ; 

( ( {v?o} look.at Joans.at.a.glance □ 

( {v?i} requestJiomedoan.details ; 

( look.atTiome.loan.samples □ skip ) {(/Jg} ) □ 

( {v?2} request jnortgage.details ; 

( look.atjnortgage.samples □ skip ) {<^i} ) )* ) ; 

( selectJiomeJoan {(fe} □ select jnortgage {tp?} ) 

will also be si- 



3 Many-Sorted Kleene Algebras with Tests 

Let A be an alphabet. Then it is well known that the set of regular expressions 
over A is inductively defined as the smallest set TZ with A ^ TZ satisfying the 
following conditions: 

— the special symbols 1 and 0 are regular expressions in TZ] 

— ior p,q &TZ we also have p + q €lZ and pq G TZ] 

— for p G 7?. we also have p* G TZ. 

The usual interpretation is by regular subsets of A*, where 1 corresponds to 
the regular language {e} containing only the empty word, 0 corresponds to 0, 
any a G A corresponds to {a}, -b corresponds to union, concatenation to the 
product of regular sets, and * corresponds to the Kleene hull. 
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3.1 Kleene Algebras 

Abstracting from this example of regular expressions we obtain the notion of a 
Kleene algebra as follows. 

Definition 3.1. A Kleene algebra (KA) /C consists of 

— a carrier-set K containing at least two different elements 0 and 1, and 

— a unary operation * and two binary operations -I- and • on K 

such that the following axioms are satisfied: 

— -I- and • are associative, i.e. for all p,q,r € K we must have p + {q + r) = 
{p + q) +r and p{qr) = {pq)r; 

— is commutative and idempotent with 0 as neutral element, i.e. for all 
p,q € K we must have p + q = q+p, p + p = p and p + 0 = p; 

— 1 is a neutral element for •, i.e. for all p € A we must have pi = Ip = p; 

— for all p G K we have pO = Op = 0; 

— • is distributive over -I-, i.e. for all p,q,r G K we must have p{q + r) = pq+pr 
and {p + q)r = pr + qr; 

— p*q ist the least solution x ol q + px < x and qp* is the least solution of 
q + xp < X, using the partial order x<y = x + y = y. 

We adopted the convention to write pq for p ■ q, and to assume that • binds 
stronger than -I-, which allows us to dispense with some parentheses. In the 
sequel we will write K, = {K, -I-, •, *, 0, 1) to denote a Kleene algebra. 

Of course, the standard example is regular sets. For other non-standard ex- 
amples refer to [10] and [11]. 

Here, we want to use Kleene algebras to represent story algebras as discussed 
in the previous section. Obviously, -I- will correspond to the choice-operator, • 
to the sequence-operator, and * to the iteration operator. Furthermore, 1 will 
correspond to skip and 0 to the undefined process fail. However, we will need 
an extension to capture guards and post-guards, we have to think about the 
parallel-operator, and we have to handle associated scenes. Capturing guards and 
post-guards leads to Kleene algebras with tests, which were introduced in [12]. 

Definition 3.2. A Kleene algebra with tests (KAT) 1C consists of 

— a Kleene algebra (A, -I-, •, *, 0, 1); 

— a subset B C K containing 0 and 1 and closed under -|- and •; 

— and a unary operation“on B, such that (A, -I-, 0, 1) forms a Boolean alge- 

bra. 



We write /C = (A, B, +, •, +, , 0, 1). 
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3.2 Representing Story Algebras by Kleene Algebras with Tests 

Now obviously the conditions appearing as guards and post-guards in a story 
algebra, form the set B of tests. So, if we ignore the parallel-constructor || for 
the moment, a story algebra gives rise to a KAT. However, we have to be aware 
that in such a KAT the operators -I- and • and the constants 0 and 1 play a 
double role: 

— The operator -|- applied to two tests ip,ip € B represents the logical OR, 
whereas in general it refers to the choice between two processes. As we have 
{if + ip)p = ipp + ifip this does not cause any problems. 

— The operator • applied to two tests pj'ip & B represents the logical AND, 
whereas in general it refers to the sequencing of two processes. As we have 
(ip'ip)p = V’i'^Pp) this also does not cause any problems. 

— The constant 1 represents both TRUE and skip, whereas 0 represents both 
FALSE and fail, which both do not cause problems, as can easily be seen 
from the axioms of Kleene algebras. 

Furthermore, we may define a scene assignment to a KAT by simply following 
the rules for the scene assignment in story algebras. That is, we obtain a partial 
mapping a : K ^ S with a set 5 = {si, . . . , s„} of scenes as follows: 

— For pi,p2 £ K with a{pi) = a{p2) = s or one of the pi is 1 or 0 or a test in 
B, then a-{piP2) = s. 

— For pi,P2 £ K with a{pi) = <j{p2) = s or one of the Pi is 1 or 0 or a test in 
B, then a{pi + P2) = s. 

— For p G K with a{p) = s we obtain <j{p*) = s. 

Finally, let us look at parallel processes. From the intuition of scenes as 
abstract locations, we should assume that atomic actions from different scenes 
can be executed in parallel, which could be rephrased in a way that the order 
does not matter. Obviously, this extends to processes that are associated with 
different scenes. Therefore, the operator || is not needed for processes that belong 
to different scenes - any order will do. More formally, this means the following: 

— If we have pi||p2 with a{pi) both defined, but different, then this will be 
represented by P1P2 (or P2P1) in the KAT. 

— In the KAT we then need piP2 = P2P1, whenever cr(pi) yf Cf{p2)- 

This leads to our definition of a many-sorted Kleene algebra with tests. 

Definition 3.3. A many-sorted Kleene algebra with tests (MKAT) is a KAT 
/C = {K,B,-\-,-,*~0,l) together with a set 5 = {si,...,s„} of scenes and a 
scene assignment a : K ^ S such that piP2 = P2P1 holds for all pi , P2 £ K with 
o-(Pi) ^ cr(p2). 

From our discussion above it is clear that we can represent a story space by 
an element of the MKAT that is defined by the atomic actions, the tests and the 



scenes. 
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Example 3.1. If we rewrite the story space from Example 2.1 we obtain the 
following KAT expression: 



ai{{ifQa2 + i^io;3(a5 + + ^20L4,{a^ + ipe>){a7(pe, + 

('^6Q:8(q:8 + l)oLQaiQaiiai2ip% + ip7aia^aiiai^a\Qaiian{tpi2ai^aiQ)* (pi^ai^ipo) 

OL2o{PW + V’ll) 



4 Applying Many-Sorted Kleene Algebras with Tests 



In order to reason about story spaces, we may now exploit the fact that they 
can be described by many-sorted KATs. 



4.1 Equational Reasoning with KATs 

Hoare logic [9] is the oldest formal system for reasoning about abstract pro- 
grams. Its basic idea is to use partial correctness assertions - also called Hoare 
triplets - of the form {ip}p{'ijj}. Here p is a program, and ip and iIj are its pre- 
and postcondition, respectively, i.e. logical formulae that can be evaluated in a 
program state. 

The informal meaning of these triplets is that “whenever the program p is 
started in a state satisfying p and terminates, then it will do so in a state 
satisfying ^/i”. 

Using KATs, such a Hoare triplet corresponds to a simple equation pptp = 0. 
Equivalently, this can be formulated by pp < pip or pip < pp or pp = pptjj. 

In [13] it has been shown that KATs subsume propositional Hoare logic 
(PHL), i.e. all derivation rules of Hoare logic can be proven to be theorems for 
KATs. However, the theory of KATs is complete, whereas PHL is not. 

In order to use KATs to reason about story spaces, the general approach is 
as follows. First we consider the atomic actions and scene and the many-sorted 
KAT defined by them. In this KAT we can express the story space or parts of 
it by some process expression p. We then formulate a problem by using equa- 
tions or conditional equations in this KAT. Furthermore, we obtain (conditional) 
equations, which represent application knowledge. This application knowledge 
arises from events, postconditions and knowledge about the use of the WIS for a 
particular purpose. We then apply all equations to solve the particular problem 
at hand. 

The application knowledge contains at least the following equations: 

1. If an action p has a precondition p, then we obtain the equation pp = 0. 

2. If an action p has a postcondition ip, we obtain the equation p = pip. 

3. If an action p is triggered by a condition p, we obtain the equation p = pp. 

4. In addition we obtain exclusion conditions pip = 0 and tautologies p+ip = 1. 
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4.2 Personalisation of Story Spaces 

The problem of story space personalisation according to the preferences of a 
particular WIS user can be formalised as follows. Assume that p G K represents 
the story space. Then we may formulate the preferences of a user by a set S of 
(conditional) equations. Let \ be the conjunction of the conditions in S. Then 
the problem is to find a minimal process p' G K such that y => px = p' x holds 
for dl\ X G K. 

Preference equations can arise as follows: 

1 . An equation p\+P2 = Pi expresses an unconditional preference of activity 
(or process) pi over p2- 

2 . An equation (p{pi +P2) = ‘PPi expresses an conditional preference of activity 
(or process) pi over p2 in case that the condition p is satisfied. 

3 . Similarly, an equation p{pi + P2) = PPi expresses another conditional pref- 
erence of activity (or process) p\ over p2 after the activity (or process) p. 

4 . An equation pip2 + P2P1 = P1P2 expresses a preference of order. 

5 . An equation p* = pp* expresses that in case of an iteration it will be executed 
at least once. 

For instance, assume that the story space is p = pi(p(p2 +P3) + (pp%P5) and 
that we have the conditional preference rules 'p{p2 + P3) = PP2 and pi^ppl = 
pi<pp4p|. Then we get 

px = pi(p(p2 + Ps) + ‘fPiP 5 )x = P1PP2X + PiPpIp 5 X = 

P 1 PP 2 X + PiPP4pIp 5X = pi{pp2 + <pp4pIp5)x. 

That is, we can simplify p by p' = pi(<pp2 + (ppiplp^). Obviously, we have 
p' ^ Pi but further equations in our application knowledge may give rise to an 
even smaller solution. Let us finally illustrate this application with a non-artificial 
example. 

Example 4 -. 1 . Let us continue Example 3 . 1 . Assume that we have to deal with 
a user who already knows everything about loans. This can be expressed by 
the application knowledge equation p^x = x for all x G K. Furthermore, as 
knowledge about loans implies that there is no need for information about loans, 
we obtain three additional exclusion conditions: 

(psipo = 0 PsPi = 0 P5<P2 = 0 

Taking these equations to the first part of the expression in Example 3.1 we 
obtain 



ai {{( poO !2 + <^10:3(0:5 + 1 )P 3 + ^20:4(0:6 + l)P 4 )*P 5 )a: = 
Q:i((PoP5Q:2 + + l)p3 -b (p2P5ai{ae -b 1 )^ 4 )* p3)x = 

Q!i((0q: 2 + 003(0:5 -b 1)P3 + 004(05 -b l)ip 4 YPb)x = 

Oil<p 5 X = 

OiX 
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That is, the whole story space can be simplified to 

oii{oi7ipe + ai3(pr) 

(v^6«8(a8 + ^)o!gaioaiiai2‘P8 + ^70!80!gai4ai5alQaiiai7((pi2aisaig)*(pi2ais(p9) 

Q^2o(<Pio + ^ll) 

This means that for a user who knows about loans the part of the story space 
that deals with information about loans including sample applications will be 
cut out. 



4.3 Satisfaction of Information Needs 

The problem of satifying the information needs of a particular WIS user can 
be formalised by assuming that there is a goal that can be represented by some 
formula ■i/i. Thus, we can take ip £ B. Furthermore, assume that our story space 
is represented by some process expression p € K. Then the problem is to find a 
minimal process p' € K such that pip = p'lp. 

In order to find such a p' we have to use the application knowledge. In this 
case, however, we only obtain the general application knowledge that we already 
described above, unless we combine the application with personalisation. 

For instance, assume we can write the story space p as a choice process Pi+P 2 - 
Let equations tpip = 0 and p 2 = P 2 <P (postcondition) be part of our application 
knowledge. If the goal is ip, we get 

pip = (pi + P2)ip = piip + pgip = PipJ + PgP'ip = Piip- 

This means we can offer the simplified story space pi to satisfy the goal ip. 
Let us finally illustrate this application with a non-artificial example. 

Example 4-. 2. Let us continue Example 3.1 and look at a user who is going to 
apply for a home loan. This can be expressed by the goal pig. Then we express 
application knowledge by the equations (piopii = 0 (a user either applies for a 
home loan or a mortgage, not for both), <piopg = 0 (a user applying for a home 
loan does not complete a mortgage application) and pgipj = 0 (a user either 
selects a home loan or a mortgage, but not both). 

Then we can simplify ppig with the expression p from Example 3.1 step by 
step. First we get {pig + pii)pig = pio, which can then be used for 

{'P60:8{c^8 + ^)ctgaigaiiai2P8 

+ ^p^a8agaual3a\Qallan{^pl2al8alg)*ipl2al8pg)^Plg = 

PQOig{ag + l)agaioaiiai2ip8‘Pio 

+ P7O!8O:sO:i4ai5a*Qaiiai7{ipi2ai8aig)*ipi2ai8pgpi0 = 

PeOi8{o-8 + ^)o'gaioaiiai2ip8Pio 
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Then finally we get 

+ ai3(p7)(pQas{cxs + l)agaioanai2ip8^io = 

0!7‘fi6‘P6O!8{'^8 + l)«9aioaiiO!i2V58<PlO 

+ Cti3</37i^6a8(a8 + l)Q^9Ckl0Q^llCti2V^8V^10 = 

«7i^6Ct8(ck8 + l)ct9aioQ!iiai2<P8'^10 

This means that the story space can be simplified to 

ai{(‘Poa2 + + l)v?3 + if2aii.aQ + l)tfi)*tp5) 

oc7<^QOL8,{a8 + l)agaiQaiiai2'f8C^2ofio 

This simply means that for a user who is looking for a home loan application 
the part of the story space that deals with mortgage application will be cut out. 



5 Conclusion 

In this paper we addressed the problem of formal reasoning about web infor- 
mation systems (WISs). We argued that the underlying application story must 
be well designed, especially for large systems. Stories can be expressed by some 
form of process algebra, e.g. using the language SiteLang from [6]. For the most 
relevant reasoning problems it is sufficient to assume that such story algebras are 
propositional, i.e. we ignore assignments and treat atomic operations instead. 

Doing this we demonstrated that Kleene algebras with tests (KATs) are 
adequate to decribe the stories. We added sorts to KATs in order to enhance 
the model by scenes, i.e. abstract bundles of user activities at the same location 
in a WIS. 

Then we demonstrated the use of many-sorted KATs to the problems of 
personalisation and satisfaction of information needs. These cover two highly 
relevant aspects of WISs. Thus, the use of KATs demonstrates a huge potential 
for improving the quality of WISs. 

There are further applications for our approach such as equivalence proofs or 
static analysis of story space specifications, but these still have to be explored. 
However, as static analysis, optimisation, equivalence, etc. have already been 
investigated as application areas of KATs in program analysis, we are confident 
that our approach will be powerful enough to solve these application problems 
as well. 

Furthermore, our research can be extended towards dynamic logic [8], in 
which case we would drop the restriction to ignore assignments. Of course we 
lose decidability properties, but we gain a more complete view of WISs, in which 
the structure and the dynamics of the underlying database is taken into account. 
This, however, implies that we have to deal not just only with storyboarding, but 
also with the subsequent step of defining database schemata, views and media 
types as outlined in [7,16]. 
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Abstract. This paper presents a component framework for inter-organizational 
collaboration in value networks in the domain of strategic supply network de- 
velopment. The domain chosen extends the traditional frame of reference in 
strategic sourcing from a supplier-centric to a supply-network-scope. The basic 
functionality provided by the component framework and discussed in this paper 
is the dynamic modeling of strategic supply networks and the collaboration be- 
tween requestors and suppliers in a dynamic network. The corresponding com- 
ponent model is introduced and the functionality provided by the modeling 
component discussed in detail. The problems of heterogeneity that come up in 
inter-organizational communication and collaboration will be addressed by in- 
troducing a collaboration component that guarantees correct interchange and 
representation of application data. It is shown what kind of interoperability 
problems will be encountered in the strategic supply network development sce- 
nario as well as how the communication and collaboration component is able to 
cope with these problems. 



1 Introduction 

With the emergence of the Internet and the continuous innovations in information and 
communication technologies, new possibilities and challenges for improving and 
automating intra- and inter-enterprise business processes arise. Technological innova- 
tions such as global, web-based infrastructures, communication standards and distrib- 
uted systems enable the integration of business processes between companies thus 
increasing flexihility of the business system and improving inter-company collabora- 
tion in value networks, often referred as inter-organizational systems (lOS), e- 
collaboration and collaboration commerce [16]. Companies can therefore more and 
more focus on their core competencies and increasingly collaborate with, and out- 
source business tasks to business partners, forming value networks in order to better 
react to fast changing market requirements. An increasing amount of practice initia- 
tives arose in order to not only support intra- but also inter-organizational collabora- 
tion in value networks. The concept of value networks itself with companies flexibly 
collaborating to design, produce, market and distribute products and services is not 
new and had been well established, e.g. by [18, 32], even before the above mentioned 
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technology had become available. However, at present IT-enabled value networks can 
be largely found in the form of rather small, flexible alliances of professionalized 
participants. The support of large value networks with multiple tiers of suppliers - as 
they can be found in many traditional production- oriented industries - still causes 
considerable difficulties. 

One of the most analyzed objects of reference in research centered around inter- 
organizational systems, virtual enterprises and value networks is the supply chain; 
especially with respect to the perceived business value. However, failed initiatives, 
primarily in the field of supply chain management, have spurred concern about the 
practicability of present approaches and theories and have shown the need for further 
refinement and adaptation. According to [17], one of the main reasons for this is the 
high degree of complexity that is connected with the identification of potential suppli- 
ers and the modeling of the supply chain structure, as well as the high coordination 
effort between entities in the supply chain. Despite the fact that both, the modeling of 
supply chains and the coordination between supply chain entities are basic principles 
in order to succeed e.g. in supply chain management, many research efforts have been 
based on the more operative interpretation of supply chain management [14, 15], 
primarily focusing on the optimization of forecast and planning accuracy, and the 
optimization of material flows over the whole supply chain. 

In order to analyze the basic principles, such as value chain modeling and collabo- 
ration between value chain entities, in collaborative networks the authors see the ne- 
cessity to set the focus primarily on strategic tasks, such as long term supplier devel- 
opment, before dealing with operative tasks of supply chain management, such as 
operative purchasing put in a network context. Strategic tasks have not been widely 
discussed in a network perspective yet even if current research work, such as [9], give 
an extended interpretation of supply chain management partly considering supplier 
relationships as part of supply chain management. Therefore the domain of Strategic 
Supply Network Development (SSND), which extends the traditional frame of refer- 
ence in strategic sourcing from a supplier-centric to a supply-network scope, is used 
in this paper to develop a generic framework for inter-enterprise collaboration provid- 
ing the basic, but essential, functionalities needed in an IT-enabled value network. 

In the domain of strategic supply network development two goals are persecuted. 
The first goal is the dynamic modeling of supply networks, supporting requestors of 
products or services to keep track of all suppliers or service providers contributing to 
a specific request in a dynamic changing environment. Therefore the concept of self 
modeling demand driven value networks is introduced in chapter 2 by means of the 
domain of strategic supply network development. Having explained the concept of 
self modeling demand driven value networks, the generic component model for the 
domain of SSND is introduced in chapter 3, giving a description of the basic compo- 
nents of the framework. 

The second goal which is persecuted by using the domain of strategic supply net- 
work development in order to develop a component framework for the inter-enterprise 
collaboration is pointed out in chapter 4. The focus there is set on the description of 
the collaboration component, which aims at providing an infrastructure for collabora- 
tion and communication between requestors, suppliers and service providers. The 
problems of heterogeneity that come up in inter-organizational communication and 
collaboration will be addressed by introducing the collaboration component that guar- 
antees correct interchange and representation of application data. It is shown what 
kind of interoperability problems will be encountered in the strategic supply network 
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development scenario as well as how the communication and collaboration compo- 
nent is able to cope with these problems. 

A first prototype implementation of the component framework for the domain of 
strategic supply network development is presented in chapter 5 addressing specific 
implementation details regarding the collaboration between network elements. Con- 
clusion and future work are given in chapter 6. 



2 Strategic Supply Network Development and the Concept of Self 
Modeling Demand Driven Value Networks 

Purchasing has become a core function in enterprises in the 90ies. Current empiric 
research shows a significant correlation between the establishment of a strategic pur- 
chasing function and the financial success of an enterprise, independent from the 
industry surveyed [8]. One of the most important factors in this connection is the 
buyer-supplier-relationship. At many of the surveyed companies, a close cooperation 
between buyer and supplier in areas such as long-term planning, product development 
and coordination of production processes led to process improvements and resulting 
cost reductions that were shared between buyer and suppliers [8]. 

In practice, supplier development is widely limited to suppliers in tier-1. With re- 
spect to the above demonstrated, superior importance of supplier development we 
postulate the extension of the traditional frame of reference in strategic sourcing from 
a supplier-centric to a supply-network-scope i.e., the further development of the stra- 
tegic supplier development to a strategic supply network development (SSND). This 
refocuses the object of reference in the field of strategic sourcing by analyzing sup- 
plier networks instead of single suppliers. Embedded in this paradigm shift is the 
concept of the value network. 



2.1 Strategic Supply Network Development 

The main tasks in the domain of strategic supply network development derive from 
the tasks of strategic sourcing. The most evident changes regard the functions with 
cross-enterprise focus. The process of supplier selection from strategic purchasing 
undergoes the most evident changes in the shift to a supply network perspective. The 
expansion of the traditional frame of reference in strategic sourcing requires more 
information than merely data on existing and potential suppliers in tier-1. Instead, the 
supply networks connected with those suppliers have to be identified and evaluated, 
e.g. by comparing alternative supply networks in the production network. As a conse- 
quence, the task supplier selection is only part of the process that leads to the model- 
ing of strategic supply networks in SSND. In addition to the modeling, identification 
and selection of suitable supply networks and composition of alternative supply net- 
works, the qualification of strategic supply networks is another major goal of SSND - 
according to qualification of suppliers in strategic sourcing. Main prerequisite is the 
constant evaluation of the actual performance of selected supply networks by defined 
benchmarks. This is important because of the long-term character of strategic supply 
network relationships. For a detailed description of the domain of SSND please refer 
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to [2, 1], where the domain has been introduced in more detail as an example domain 
for the identification and modeling of component based business applications and for 
the standardization of collaborative business applications. 



2.2 Concept of Self Modeling Demand Driven Value Networks 

SSND supports companies in identifying and developing their strategic networks in 
order to improve their productivity and to compete on the daily market. The concept 
of the supply network as a self modeling demand driven network constitutes the basis 
for the identification of strategic supply networks. 

The concept is based on requests for information regarding a specific product (de- 
mands) and specified by a producer (OEM). The demands can either be fulfilled by 
the own production company or need to be sent to existing or potential suppliers in 
order to receive information about product’s producibility. Since not only information 
about the supplier in tier-1 is required by the OEM in order to strategically develop 
the supply network, the demands are split on each node in sub-demands, which are 
then forwarded to the next suppliers in the value network. Every node in tier-x re- 
ceives demands from clients in tier-(x-l) and communicates sub-demands, depending 
on the demand received, to relevant suppliers in tier-(xH-l). Since every node repeats 
the same procedure, a requestor receives back aggregated information from the whole 
dynamically built network based on a specific demand sent at a specific time. 

At the core of the concept of self modeling demand driven networks is the notion, 
that network nodes of a supply network can be identified by applying the pull princi- 
ple. With the pull principle (OEM requesting information from suppliers), a network 
node at the beginning of a (sub-)network can identify potential nodes, i.e. suppliers, in 
a subsequent tier by performing a bill of materials explosion. With this information, 
primary requirements and dependent requirements can be identified and the respective 
information can be communicated - sending a demand - to the respective network 
nodes, i.e. potential suppliers for dependent requirements, in the subsequent tier, as 
these suppliers are generally known by the initiating lot. 

The concept is illustrated in the following by means of an example supply network, 
as shown in Fig. 1. The figure on the left shows a complete demand driven network 
constituted of existing (highlighted nodes) and alternative supply sub-networks. Exist- 
ing sub-networks are those with whom the producer already collaborates. Alternative 
sub-networks are networks which are built by sending a demand for a specific product 
to new chosen suppliers, with yet no relation to the producer. The whole network is 
demand driven since the producer communicates a specific strategic demand, by per- 
forming a bill of materials explosion, to existing and selected alternative suppliers in 
tier-1. Subsequently, the suppliers in tier-1 perform themselves a bill of materials 
explosion reporting the corresponding sub-demands to their own respective suppliers. 

E.g., for supplier 1-2, these are the suppliers 2-2, 2-3 and 2-4 in tier-2. In the fol- 
lowing, these suppliers report the newly defined sub-demands to their related suppli- 
ers in tier-3, which split-lot transfer the requested information including e.g. ability of 
delivery for the requested product, capacity per day, minimum volume to be ordered, 
time of delivery. The requestors aggregate the information received from all suppliers 
contacted for a specific request with the own information and send it back to the sup- 
plier 1-2 in tier-1. Having aggregated the information of all suppliers, the supplier 1-2 
adds its own information before split-lot transferring it to the producer. 
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Fig. 1. Left; Supplier network. Right; Alternative supply network 



With the suppliers’ data locally available, the producer can visualize the selected 
sub-network, in which each participant constitutes a network hub. Based on that data, 
the producer is able to evaluate the performance of that selected sub-network by self 
defined benchmarks. In order to optimize sub-networks, alternative demand driven 
sub-networks can be visualized and modeled by applying the same concept as de- 
scribed above to newly defined suppliers. Fig. 1 on the right highlights an alternative 
virtual supply sub-network fulfilling the requirements for a product of the specific 
demand sent. In the event of this alternative supply sub-network being the best per- 
forming one in the whole network, the existing supply sub-network can be modified, 
substituting supplier 1-2 in tier-1 with the new supplier 1-1, while keeping supplier 2- 
2 in tier-2 and supplier 3-1 in tier-3. Having the fact that requestor-supplier relation- 
ship may change over time, new dynamically modeled supply networks - which may 
differ from the actual ones - are build whenever sending out new demands to the 
suppliers in the subsequent tiers. 



3 Component Framework 

To provide a basis for modeling demand driven value networks, (described in chapter 
2.2), a component framework (see Fig. 2) for the domain of strategic supply network 
development has been developed offering basic functionality for the modeling of, and 
for the collaboration in value networks. 

The component framework is based on the business component technology as de- 
fined in [25]. The underlying idea of business components combines components 
from different vendors to an application which is individual to each customer. The 
principle of modular black-box design has been chosen for the framework allowing 
different configurations of the SSND system - by combining different components 
regarding the need of the specific node - ranging from a very simple configuration of 
the system - to a very complex and integrated solution of the system. The framework 
therefore provides not only the basic functionality needed on each network node in 
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Fig. 2. Component framework for the domain of strategic supply network development 



order to participate in the value network but it also provides the possibility of adding 
new functionality while composing additional components to the framework e.g. 
adding a component for the evaluation of supply networks. 

The core part of the component framework is the component model SSND as 
shown in the middle of Fig. 2 in accordance with the notation of the Unified Model- 
ing Language [21]. Five components have been identified and designed for the SSND 
framework. The supply network development-, performance administration- and offer 
manager-components provide the business logic for the domain of SSND. The busi- 
ness components have been derived based on the Business Component Modeling 
(BCM) process as introduced by [1] and are developed and specified according to the 
Memorandum of Standardized Specification of Business Components [26]. The com- 
ponent supply network development is the main business component responsible for 
the self modeling of demand driven value networks in the domain of strategic supply 
network development, as introduced in chapter 0. The component provides services 
for the administration of demands (specifying demands, accessing stored demands, 
deleting demands etc.) and for the administration of strategic supply networks (storing 
supply networks, updating strategic supply networks etc.) and provides the core busi- 
ness functionality for the component framework. In order to find potential suppliers 
for a specific demand, each company in the network has to provide catalogue infor- 
mation about products offering. The component offer manager therefore provides 
functionality for the administration of such catalogues. The catalogue information and 
the companies’ contact information are made available to all network participants 
through a central organized directory service (see Fig. 2). The reason therefore is to 
provide new companies with the possibility of publishing their catalogue information 
in order to enable them to participate in the value network. This provides added value 
for the network allowing companies to find and contact new potential suppliers over 
the directory service while directly sending demands to existing suppliers. When 
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sending out demands to suppliers, in order to strategically develop the value network, 
it is relevant not only to receive information about products’ producibility but also to 
receive information about companies’ performance. Enterprises participating in the 
value network have therefore to provide self information', that means e.g. information 
about the legal status of the company, year of creation, name of CEO, workforce, 
volume of sales, profit and loss calculations etc. The business component responsible 
for the administration of self information is called performance data administration in 
the component framework. The three business components presented, provide the 
basic business functionality needed in order to strategically develop supply networks 
based on demands and companies’ performance data. 

Additionally to the business components introduced, the component framework 
provides two system components - persistence manager and collaboration manager - 
responsible for the technical administration of the data and for the collaboration be- 
tween network nodes. The information managed by the offer manager-, supply net- 
work development- and performance data administration-component is made persis- 
tent through the persistence manager. The main reason of introducing the persistence 
manager is based on the idea of having business components concentrating on the 
business logic while having system components taking care of implementation spe- 
cific details. This has an impact on the distribution of the SSND system on network 
nodes, having the fact that different companies use different physical database sys- 
tems as data storage. The framework handles that situation in having the persistent 
manager taking care of implementation specific details without affecting the business 
logic of SSND. The framework provides three semantic storages for SSND data. The 
supply network database stores all supply networks containing the aggregated infor- 
mation of suppliers contributing to a specific demand. For each demand, a new supply 
network is generated by split-lot transferring data from all suppliers and aggregating 
the information in the supply network development component. Such a network is 
then stored in the supply network database through the services provided by the per- 
sistent manager and called by the supply network development component. The in- 
formation can e.g. be retrieved in order to visualize and strategically develop the sup- 
ply networks. The performance database provides storage for the companies’ self 
information and the material group database is responsible for storing the products 
offered by the company. The material group database additionally stores information 
for mapping the material group numbers of suppliers to the own internal representa- 
tion of material group numbers. A mapping can either be based on product classifica- 
tion standards such as eCl@ss [10] or UN/SPSC [28] or - if a supplier does not make 
use of a specific standard - on tables cross referencing the product numbers of the two 
companies [13]. 

Example clients requesting collaboration services from SSND can either be graphi- 
cal user interfaces (GUI), asking for data e.g. to visualize strategic networks, or other 
network nodes sending demands to suppliers. The collaboration in the SSND frame- 
work is executed by the collaboration component. Regarding the complexity of col- 
laboration in inter-enterprise systems a detailed description of the collaboration com- 
ponent is given in chapter 4. 
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Fig. 3. Semantic model of inter-organizational collaboration 



4 Collaboration Component 

Successful inter-organizational collaboration is based on a suitable communication 
infrastructure that allows the coordination of tasks to achieve a given goal. Therefore 
a reusable collaboration component needs to support all basic functionality necessary 
in different information exchange scenarios. In order to illustrate the significance of 
the generic collaboration component for information exchange, a generic semantic 
model for inter-organizational collaboration has been developed. Fig. 3 shows a frag- 
ment of the general model to illustrate the collaboration problems in the domain of 
strategic supply network development. 

The basic principle is the participation of actors in relationships in order to pursue 
a goal (see Fig. 3). This could be two or more business partners that want to ex- 
change strategic supply network information in order to plan their demands, but also 
software applications sending data to each other in a distributed environment. In order 
to achieve the goal defined in such a relationship activities need to be performed. 
These activities are coordinated by a process that defines in what order and under 
which circumstances they are executed to reach the desired final state. As indicated in 
Fig. 3 activities can consist of other activities so that even complex operations or 
process chains can be represented. Typical activities are: send, receive, transform, and 
define. Note that even a simple activity like send can contain more than one subtask 
that can be described as activity. 

Activities always use or create activity objects like demand, address, or process 
description that contain relevant information. Each of these activity objects can be 
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represented by an arbitrarily complex data structure. The syntax of a data structure 
describes the rules how to build up correct expressions based upon a set of characters 
and how to arrange these expressions to get the data structure itself. Every data struc- 
ture is associated with a corresponding type i.e. if a demand document and its corre- 
sponding demand type are defined, then every document with the same data structure 
is considered to be a demand. Ideally two separately defined types with the same 
name should have the same information content, but that is dependent on the actors 
that agreed upon this definition. The information content is also referred as the 
semantics of a data structure [22]. So the semantics provides the meaning of a data 
structure based on its type. In an inter-organizational scenario with many participants 
it is desirable to agree upon a data description standard to avoid problems with differ- 
ently defined types. This can not always be achieved; in this case data structures have 
to be syntactically and/or semantically transformed in order to be understood by the 
receiver. 

With activities and activity objects it is possible to model actions like send de- 
mand, process answer or get self-information. Activities are executed in a context that 
determines additional properties like quality of service, security or transactional be- 
havior. A particular context can cause other activities that need to be performed in 
order to meet the criteria e.g. if an activity is executed in a secure context then e.g. 
data needs to be encrypted with a certain algorithm where a key must be exchanged 
between business partners in advance. 

The activity information exchange (see center of Fig. 3) plays a special role in an 
inter-organizational setting. It represents the communication functionality that allows 
to coordinate activities of participating actors by passing the activity objects needed 
between them. For example it is necessary that a supplier receives a description of its 
customer’s demand before it can check order information for this particular demand. 
A so called protocol defines the details of an information exchange such as message 
sequence and types of exchanged information. Protocols need to be mapped to the 
processes that coordinate the activities within participating organizations i.e. informa- 
tion that is contained in an incoming message must be extracted and routed to the 
appropriate application (actor) in order to be used in an activity at that time the proc- 
ess flow schedules this activity. Since every organization has only insight into its own 
processes and doesn’t want to show any of its process details to its business partners, 
corresponding public processes are defined to model inter-organizational aspects of 
the overall process. A public process describes the sequence of sending and receiving 
information types from one organization’s point of view [29, 7]. As this sequence is 
determined by the previously mentioned protocols, public processes provide an inter- 
face between protocols and private processes. 

When exchanging information, messages are used to transport the actual activity 
object. They contain communication specific meta-information like sender and re- 
ceiver address. The activity information exchange can be decomposed into three spe- 
cific activities send message, transport message, and receive message that are subject 
to a specific underlying communication model which then can be implemented by a 
concrete channel. Depending on the communication model properties, these activities 
can have different functionalities. The exchange flow defines how messages are trans- 
ported from sender to receiver (directly or indirectly over one or more intermediaries). 
Data exchange specifies whether data is exchanged physically or only a reference to 
data somewhere else is used. Addressing describes the method how the phyisical 
address of the message receiver is determined. Addressing can be static i.e. the ad- 
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dress is known in advance, or it can be dynamic i.e. it is determined at runtime [3]. 
The Response behavior defines the degree of coupling between the actors. Messages 
that are exchanged synchronically between actors form a tightly coupled conversation 
whereas asynchronous communication allows a looser coupling between the actors. 

By looking at the explained inter-organizational model based on information ex- 
change one can now identify problems that could arise because of conflicting assump- 
tions in a collaboration scenario and which need to be handled by a collaboration 
component. Regarding to [12] conflicting assumptions are the following: 

Assumptions about incoming and outgoing data: As already mentioned, informa- 
tion represented as data structure is exchanged between actors. Problems can arise if 
either the syntax or the semantics of a data structure is not as expected. To change the 
syntax of a data structure its information content has to be transformed in a way that it 
is expressed according to the target syntax rules [33]. Semantical transformation must 
handle different terminologies [11] and different information content. Therefore, a 
collaboration component has to provide a transformation engine. Not only is this 
necessary in communication with other organizations but also internally when data is 
needed to be read from legacy applications. So called adapters have to be provided to 
integrate data from all known data sources. These adapters are also based on syntacti- 
cal and semantical transformations. 

Assumptions about patterns of interaction characterized by protocols: As private 
processes, especially those concerning business logic, are often already defined before 
inter-organizational collaboration is initiated, problems with the expected message 
sequence of the information exchange could arise. Protocols concerning the connec- 
tivity of applications are often standardized and do not affect higher-level protocols. 
A flexible collaboration component should therefore support different concrete chan- 
nels that implement these standards i.e. it must be capable to send and receive data 
over these channels (e.g. HTTP, SMTP, SOAP). On the contrary, business protocols 
do highly affect private processes. So if the message sequence does not fit to the pri- 
vate process and necessary information is not available at time, the process can not be 
executed although the information is available at the sender’s side [11]. Therefore a 
wrapper is needed to generate or delete certain messages according to the target proc- 
ess. 

Assumptions about the topology of the system communication and about the pres- 
ence of components or connectors: Even if a collaboration component implements 
various communication channels it has to know which one to use when sending in- 
formation to a particular receiver. This information, including the actual address, must 
be available for each potential receiver. If some activity needs a special context, this 
must be recognized by all participants; for example, if encrypted information is sent 
then the receiver must be able to decrypt this information. Sometimes additional pro- 
tocols must be included into the existing protocol infrastructure to meet a certain 
context agreement (e.g. acknowledgment messages for reliable communication). 
Again these protocols have to be supported on both sides. They can also be standard- 
ized protocols that every participant integrates into its existing protocol infrastructure, 
e.g. transaction protocols like parts of BTP [20] or WS-Transaction [5], or protocols 
that can not be easily integrated without affecting private processes. In this case prob- 
lems can arise as mentioned above when discussing sequence mismatches in business 
protocols. 
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Fig. 4. Detailed view of the collaboration manager component 



Regarding the functionality needed for inter-organizational collaboration in hetero- 
geneous environments in order to solve collaboration conflicts - as already introduced 
- the internal view of the collaboration manager component has been developed (see 
Fig. 4) and the interfaces provided by the collaboration manager specified in detail. 

The Actor & Protocol Management component manages all information concern- 
ing potential communication partners and all corresponding collaboration specific 
data e.g. supported channels, supported protocols (communication, context, and busi- 
ness logic), and supported data formats. Depending on this information the collabora- 
tion manager component can be configured for each partner. The Communication 
component for example can choose the right send and receive functionality based on a 
selected channel. The same is true for the Data Transformation component that plugs 
in appropriate transformation adapters (not shown in Fig. 4). If necessary the Process 
Binding component implements the wrapper functionality to solve problems related to 
the message sequence as discussed above. It also integrates additional protocols 
caused by a specific context into the process. 



5 Prototype Implementation of the Component Framework 

For prove of concept, a first prototype implementation of the SSND component 
framework has been developed and an example supply network for the development 
of an electronic motor ranging from the OEM to tier-5 has been built. Therefore the 
SSND prototype has been installed on each node of the supply network including the 
OEM node. In this chapter a quick overview of the SSND prototype and some imple- 
mentation details focusing on the collaboration component are given. 

An example view of a dynamic modeled supply network for the production of an 
electronic motor executed by the SSND system is shown in Eig. 5. Only a selected 
area of the whole supply network is shown. 

The rectangles represent the different companies of the supply network visualized 
with important information about the node contributing to the supply network of the 
electronic motor. Relevant information for the requestor about the suppliers is e.g. 
name of the company, material group the supplier is producing, minimum volume 
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Fig. 5. Dynamic modeled supply network for the production of an electronic motor 



necessary to be ordered and capacity per day. The companies are visualized in the 
SSND prototype in different colors, differentiating between a) suppliers able to de- 
liver the product and amount requested b) suppliers not answering to the demand sent 
c) suppliers which are not online or where a communication problem exists and d) 
suppliers which do not have enough capacity for producing the required product, or 
where the order volume required by the client is too low. The tool provides different 
modes of visualizing the network - adding more detailed information to the nodes, 
showing just parts of the network, etc. - in order to support the requestors with all 
necessary information for developing their strategic supply networks. Giving a de- 
tailed description of the tool would go beyond the scope of this paper. 

To implement the pull technology of self modeling demand driven networks in 
SSND, the concept of asynchronous communication (as introduced in chapter 4) has 
been implemented in the prototype application and has been combined with the Web 
Service technology responsible for the exchange of messages between components 
distributed on different nodes. Web services are a new promising paradigm for the 
development of modular applications accessible over the Web and running on a vari- 
ety of platforms. 

The Web service standards are SOAP [30] - supporting platform independency - 
WSDL [31] - specifying the interfaces and services offered - and UDDI [27] - used 
for the publication of the Web services offered by a specific company. All standards 
are based on the extensible Markup Language (XML) [6]. An overview of standards 
and related technologies for Web services is given in [24]. The communication com- 
ponent therefore implements the Web Service interface (called W3SSND in the 
SSND prototype) in order to provide access to the component through Web Services. 
The services offered are e.g. process requests, process replies providing different 
addresses for the different message types and exchanging the data not by reference 
but by including it in the messages (see communication model properties in Sect. 4). 
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The Web Service interface, the contact information and the products offered by a 
specific company are described in a WSDL document and made publicly available 
over the directory service (UDDI). Therefore new companies can be identified 
through the UDDI directory service while searching for potential suppliers of a spe- 
cific product and addressing is dynamic being determined at runtime. Additionally the 
services provided by each company are globally available and accessible by any other 
node in the SSND network, allowing the exchange of messages between network 
nodes. The exchange of messages using the Web Service technology is shown in 
Fig. 6. 

The ovals in the picture represent company nodes in the network of SSND. Each 
company has the SSND prototype installed, containing all components shown in 
Fig. 2. Every node offers therefore services provided by the collaboration component 
as Web Service, which are made publicly available over the UDDI directory service. 
A company defines sub-demands required for a specific product by a bill of material 
explosion, e.g. the sub-demands required from company X for the product A are B, C 
and D. For each demand and sub-demand a process is started on the network node 
waiting for a response while communicating the sub-demands by messages to the 
companies. Company X calls therefore the Web services process request of the com- 
panies S, T, U and Y. Since the companies S, T and U do not need any material from 
other companies, they send back the information about ability of delivery calling the 
Web Service process reply of company X. Each response is handled by the corre- 
sponding sub-demand process. Company Y instead identifies products needed from 
the own suppliers by a bill of material explosion, starting sub-demand processes E, F 
and G and communicating the sub-demands to the own suppliers. 

Receiving back the information about delivery abilities, the company Y aggregates 
the information and returns the own supply network for that specific demand to the 
company X. Company X, aggregating the information from all suppliers, is then able 
e.g. to visualize the suppliers’ network (top-right in Fig. 6) for further evaluation. 

Regarding the exchange flow in the distributed system - as introduced by the 
communication model properties in chapter 4 - messages are transported directly 
from the sender to the receiver. As defined by [23], the network of independent sys- 
tems that constitute the strategic supply network appears to the user of a single node 
inside that network as a single system and therefore represents a distributed system. It 
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is an open peer group of loosely coupled systems. Every node is autonomous and 
implements functions for its own view of the network, no hierarchy exists. Regarding 
the different roles a node can play - e.g. being the producer sending a demand to 
known suppliers or being a supplier receiving demands from a client and either send- 
ing the related sub-demands to the known suppliers or sending the answer back the 
client - each node in its context of the application is able to store data, to send and 
receive demands or answers from other nodes in the network. The communication 
takes place between peers without guaranty that a node is always online and contrib- 
uting to the network. Regarding all those aspects mentioned, the application for stra- 
tegic supply network development can be considered as a peer-to-peer application 
having all main features - client-server functionality, direct communication between 
peers and autonomy of the single nodes - of today’s peer-to-peer applications as de- 
fined by [4] and [19]. The only difference of the SSND system to today’s understand- 
ing of peer-to-peer applications is the initialization of new nodes in such a peer-to- 
peer network. In a peer-to-peer network a new node can contact any other node of the 
network in order to become a member. In the SSND network a new node does always 
have to contact a specific node, namely the directory service. The advantage of such a 
solution is that the companies building new strategic networks can request informa- 
tion about new companies from the directory service and alternatively send a demand 
to a new member in tier-1 additionally to the known nodes. 



6 Conclusion and Future Work 

This paper presents a component framework for inter-organizational communication 
in the domain of strategic supply network development, based on the concept of self 
modeling demand driven networks. The core of the framework builds the component 
model. The business components provide the main business logic for the development 
of strategic supply networks, whereas the system components are handling data per- 
sistency and problems of heterogeneity that come up in inter-organizational commu- 
nication and collaboration. A first prototype implementation of the framework has 
been introduced, focusing on implementation details regarding collaboration between 
network nodes. 

While in the first prototype application the focus was set on the feasibility of the 
concept of self modeling demand driven networks and the exchange of messages 
within a more homogeneous environment, having all nodes installed the same system, 
in further extensions of the prototype collaboration concepts introduced in the com- 
ponent framework to solve the problems of heterogeneity need to be implemented. 
Additionally, as product related information used in the bill of material explosion 
constitutes the basis of the concept of self modeling networks, interfaces to existing 
PDM, ERP and PPS systems have to be defined to enhance the applicability of the 
system. 
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Abstract. Knowledge discovery in databases (KDD) plays an impor- 
tant role in decision-making tasks by supporting end users both in ex- 
ploring and understanding of very large datasets and in building pre- 
dictive models with validity over unseen data. KDD is an ad-hoc, iter- 
ative process comprising tasks that range from data understanding and 
preparation to model building and deployment. Support for KDD should, 
therefore, be founded on a closure property, i.e., the ability to compose 
tasks seamlessly by taking the output of a task as the input of another. 
Despite some recent progress, KDD is still not as conveniently supported 
as end users have reason to expect due to three major problems: (1) 
lack of task compositionality, (2) undue dependency on user expertise, 
and (3) lack of generality. This paper contributes to ameliorate these 
problems by proposing an abstract algebra for KDD, called K-algebra, 
whose underlying data model and primitive operations accommodate a 
wide range of KDD tasks. Such an algebra is a necessary step towards 
the development of optimisation techniques and efficient evaluation that 
would, in turn, pave the way for the development of declarative, surface 
KDD languages without which end-user support will remain less than 
convenient, thereby damaging the prospects for mainstream acceptance 
of KDD technology. 



1 Introduction 

In spite of great interest and intense research activity, the field of knowledge dis- 
covery in databases (KDD) is still far from benefiting from unified foundations. 
Research focus has been patchy by and large. Efforts are all too often devoted 
to devising new, or extending existing, algorithms that are very specifically tai- 
lored to very specific contexts, and hence not as widely applicable as one would 
hope, given the large variety of application scenarios in which KDD could have 
a significant impact. 

At present, there is no foundational framework that unifies the different rep- 
resentations in use, let alone a well-founded set of generic operations that both 
cohere, when taken together, and are flexible enough to express a significant num- 
ber of complex KDD tasks in a convenient way. The lack of such foundations 
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is a significant impediment in the way of mainstream acceptance, and of wider 
deployability, of KDD technologies. This is because without such foundational 
framework it is very difficult to develop declarative, surface KDD languages 
without which end-user support may always remain less than convenient. 

This is very much reflected in the difficulties data analysts face in interacting 
with KDD systems. KDD is an ad-hoc, iterative process comprising tasks that 
range from data understanding and preparation to model building and deploy- 
ment. The typical flow of work in KDD projects requires end-users (i.e., data 
analysts and knowledge engineers) to operate a KDD suite that, particularly 
with respect to the data mining phase, acts as a front-end to a collection of algo- 
rithms. Unfortunately, each algorithm tends to be very specialised, in the sense 
that it produces optimal outcomes only under very narrow assumptions and in 
very specific circumstances. Moreover, different algorithms tend to depend on 
specific representations of their inputs and outputs, and this often requires the 
end user carrying out representational mappings in order for tasks to compose in 
a flow of work. More importantly, if all one has is a suite of algorithms, then end 
users bear the ultimate responsibility for overall success, since they must choose 
the most appropriate algorithm at each and every step of the KDD process. Bad 
choices may well delay or derail the process. Given that the KDD literature has 
given rise to a myriad different, very specific algorithms for each phase of the 
KDD process, and given that sometimes the differences are subtle in the surface 
but with significant import in terms of effectiveness and efficiency, the choices 
faced by end users could be bewildering. Thus, it can be seen that the state of 
the art in KDD still suffers from some significant shortcomings. The main issues 
identified above are: 

1. Lack of compositionality, insofar as contemporary KDD suites fail to achieve 
seamless composition of tasks. Inputs and outputs have different represen- 
tations (e.g., different file format) and the need to map across them hinders 
the transparency of the process. Moreover, due in part to type mismatches, 
there is a lack of orthogonality in the way the tasks are composed. For exam- 
ple, while it is possible to apply selection to (i.e., take a subset of) a dataset, 
that same generally does not apply to a decision tree (i.e., choose a subset 
of branches), despite being naturally reasonable to do so. The challenge here 
is to find a unified representation for KDD inputs and outputs. 

2. Dependency on user expertise, insofar as there is a large number of candidate 
algorithms at almost every step, for almost every task, and each in turn has 
several parameters which require expertise in being set if good results are 
to be obtained. Often these choices affect considerably not only how satis- 
factory the results are but also how efficiently they are obtained. Typically, 
these decisions are far too complex to be undertaken by casual users. The 
challenge here is to find specification mechanisms that are amenable to for- 
mal manipulation by an optimisation algorithm so as to protect the end user 
as much as possible from inadvertent choices. 

3. Lack of generality, insofar as it would be much more convenient if the body 
of algorithms proposed in the literature were better structured and if ab- 
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stract generalisations were available that could help the selection of concrete 
instances. It has been observed empirically that, despite the large number 
of different algorithms proposed for similar tasks, they do seem to be based 
on the same basic principles, differing in very particular respects that make 
them more applicable to certain contexts than others. The challenge here is 
to find abstractions that are comprehensive enough to unify the concrete, 
specialist algorithms that comprise the bulk of the KDD literature. This 
would allow the basic principles to be more vividly understood and make 
it possible to define, and reason about, a wide range of KDD tasks more 
systematically, and even mechanically. 

A unifying formal framework for KDD task composition is a possible, con- 
certed response to the challenges identified. As a concrete candidate for such 
formal framework this paper contributes the K-algebra, an abstract algebra 
for KDD whose underlying data model and primitive operations accommodate 
a wide range of KDD tasks. The K-algebra is a database algebra, i.e., its under- 
lying concern is with flow throughput in situations where the volume of input 
at the leaves of the computation are significant enough to be the dominant fac- 
tor in the efficiency of the latter. It must be stressed that the K-algebra is an 
abstract algebra, i.e., it is proposed as a candidate intermediate representation 
between surface languages and executable plans. It is best understood, therefore, 
as the proposal of an abstract machine that paves the way for systems to take 
decisions on behalf of the users as to how complex KDD tasks are configured 
and composed, similarly to what current query optimisers deliver for database 
systems. It is not a proposal for dealing directly with concrete effectiveness and 
efficiency concerns: this is what the bulk of the KDD literature is already doing 
so commendably. 

This paper, therefore, takes the view that defining the foundations for a KDD 
language that can be subject to optimisation is more likely to succeed than con- 
tributing yet another too-specific tool to an already very large tool set. The mo- 
tivation for our approach stems from the observation that database technology 
was once limited to a comparable motley collection of poorly related techniques, 
and the same consequences were noticed then. A historical breakthrough was 
the development, by Codd, of a unifying formal foundation as an abstract alge- 
bra, which was shown to be expressive enough to express complex combination 
of tasks, that were mappable to the concrete algorithms and techniques. The 
subsequent introduction of declarative languages empowered end users to query 
databases in an ad-hoc manner, expanding greatly the possibilities for query 
retrieval. Ultimately, this was all made concrete with the development of query 
optimisers, which can take advantage of algebraic properties to seek efficient 
plans to evaluate the queries over databases. This algebraic approach proved 
successful for other database models and languages developed subsequently, such 
as to object-oriented [4] and spatial [8] data, and, more recently, to XML [11] 
and streams [16], to name a few. 

Other researchers have had similar motivations, and it seemed natural to 
them to consider expressing KDD tasks in a standard database language, for 
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which a robust concretely implemented counterpart exists. However, significant 
technical impediments stand in the way. For example, standard database lan- 
guages only have limited capability (and often none at all) to express recur- 
sion or iteration, which seem essential in KDD. For example, it is shown by 
Sarawagi et al. [15] that the implementation in SQL of Apriori, a classical scal- 
able algorithm for association rule discovery, falls short in terms of performance. 
A similar algorithm is expressed in 
terms of an extended relational al- 
gebra by Ceri et al. [13]. However, it 
is not clear whether this approach 
would apply to other more chal- 
lenging KDD tasks. 

To illustrate the K-algebra, the 
paper includes a specification of 
a top-down decision-tree induction 
process in abstract-algebraic form. 

To the best of our knowledge, this 
is the first such formulation. In or- 
der to have a concrete example, 
the next subsections make use of 
the simple dataset [14] in Fig. 1. 

Briefly, it records weather condi- 
tions over a time period (say, days) 
and labels each instance with the 
information as to whether a partic- 
ular sport was played in that day. 

The general goal is to use such a la- 
belled dataset to construct a decision tree 
that, based on the values of certain of the 
attributes (e.g., outlook and windy) can de- 
termine the value of the desired attribute 
(called the class) play. In other words, the 
decision tree hypothesises a label given an 
non-labelled instance. Fig. 2 depicts the de- 
cision tree that Quinlan’s IDS [14] outputs given the dataset in Fig. 1. The re- 
mainder of this paper is structured as follows. Section 2 describes the K-algebra 
informally. Section 3 shows how it can express top-down decision-tree induction. 
Section 4 briefly discusses related work. Section 5 draws some conclusions. 
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Fig. 1. The Weather Dataset 
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Fig. 2. Example Decision Tree 



2 The K- Algebra: An Abstract Algebra for KDD 

2.1 The Key Underlying Concepts 

The K-algebra is based on an abstract view of the KDD process as consisting of 
arbitrarily complex compositions of steps that expand, compress, evaluate, accu- 
mulate, filter out and combine, in a iterative fashion, collections of summaries 
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of data. The italicised terms in the previous sentence denote the key underlying 
concepts of the K-algebra. They underpin the expression of a wide range of KDD 
tasks. 

The K-algebra has two carriers: K-cubes, which are annotated multi-dimensi- 
onal structures to represent individual summaries; and K-cube-lists, which, un- 
surprisingly, collect K-cubes into lists, in order to support iterative collection- 
at-a-time processing of K-cubes. 

The typical strategy for specifying KDD tasks in the K-algebra is to obtain 
a most-detailed K-cube from which further candidates are derived, directly or 
indirectly. At each iteration, this base K-cube is often used to give rise to derived 
ones through the application of operations that either expand or compress a K- 
cube into one or more K-cubes. By ‘expanding’ (resp., ‘compressing’) is meant 
applying operations that, given a K-cube, increase (resp., decrease) its dimen- 
sionality or its cardinality, to yield a (usually slightly) different summarised 
perspective on the data. These derived summaries are evaluated, as a result of 
which they are annotated with one or more measures of interestingness ^ . 

The interesting ones, i.e., those who meet or exceed a given threshold, are 
accumulated, whereas the remaining ones are either refined further or filtered 
out. Alternatively, the K-cubes in the current accumulation may be combined 
(assembled back) into a single K-cube for further exploration. 



2.2 The K- Algebra Carriers 

The fundamental carrier of the K-algebra is a set of structures referred to 
as K-cubes. It is designed to unify the inputs and outputs of a range of 
KDD tasks. Informally, a K-cube is a hypercube to which is associated a 
vector of attributes called annotations. A hypercube is an aggregated rep- 
resentation of extensional data in a multi-dimensional space, in which at- 
tributes called dimensions play the roles of axes. The (discrete) dimen- 
sional domains divide the space into units called cells. Attributes called mea- 
sures represent properties, usually aggregated values, of the data inside cells. 
For example, a K-cube representing a contingency table 
from the weather dataset is depicted in Fig. 3 in its 
flat representation. The attributes outlook and play are 
dimensions. Absolute frequency, denoted by af, is an 
aggregate measure. These are usually associated with 
functions on sets of values such as Sum, Max and Min. 

In contrast, rf is a scalar measure denoting the relative 
frequency obtained by a function on scalars (in this 
case, rf ^ af/aftotal, where aftotal = 14). Each row in ^ Example K-Cube 

the (flat representation of a) K-cube corresponds to a 
cell. 
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Note that, here, the concept of interestingness is inspired by, but more loosely con- 
strued than is the case in, [3]. 
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Annotations decorate the summarised information (represented as a hyper- 
cube) with interestingness values. For example, here, it might range from grand 
totals and global measures, such as the total frequency aftotal in Fig. 3, to more 
specific measures such as the predictive accuracy of a decision tree. As it is shown 
later, in the K-algebra, annotations result from the computation of measures. K- 
cubes are collected in K-cube-lists, which are the second carrier of the K-algebra. 
There is no significant import on using lists as the concrete collection type: they 
are chosen because they simplify the presentation of those K-algebraic functions 
that operate on collections of K-cubes. 



2.3 The K- Algebraic Operations 

Recall that the K-algebra was said to be based on an abstract view of the KDD 
process as consisting of arbitrarily complex compositions of steps that expand, 
compress, evaluate, accumulate, filter out and combine, in a iterative fashion, 
collections of summaries of data. The previous subsection threw some light into 
what is meant by ‘collections of summaries of data’. This subsection presents 
the operations that support the above steps and that provide the means for 
iterating over intermediate results. The full formal definition [7] is not given 
here for reasons of space. 

Figs. 4 and 5 list the signatures for the operations and organise them in 
categories. In Figs. 4 and 5, the intended domains and ranges are as follows: N 
denotes a finite segment of the natural numbers. K is instantiated by a single 
K-cube, L by a K-cube-list. For example, the signatures of the operations in 
Fig. 10(g) reflect the conversion of a K-cube into a K-cube-list. D, M and A 
are, respectively, instantiated by a set of names of dimensions, measures and 
annotations. For example, generalise (Fig. 4(a)) requires the identification of the 
dimensions to be removed from a K-cube. In turn, G denotes generators that as- 
sign the result of an expression, usually involving scalar and aggregate elements, 
to an attribute. This is illustrated by the signature of the operation add-scalar- 
measure in Fig. 6(c), where G represents the association of a new measure with 
an expression as a composition of scalar functions and values. F is instanti- 
ated by an aggregation function, such as Max or Min, applied on a particular 
measure. V and E apply to higher-order functions only. The former denotes a 
binding variable to K-cube elements in the lists being processed, whereas the 
latter represents an algebraic expression (over the binding variables) to be ap- 
plied to a K-cube-list. Finally, R is concerned only with bridge operations: it 
is instantiated by a relation in the extensional layer, as explained later in this 
subsection. 

To see the intended structure in Fig. 4, note that Fig. 4(a) lists the oper- 
ations on dimensions. Fig. 5(b) those on annotations. Fig. 6(c) those on mea- 
sures, Fig. 7(d) those on attributes, and, finally. Fig. 8(e) those on cells. As ex- 
pected, the basic design is that one can add, remove, convert and rename dimen- 
sions (e.g., specialise, generalise), annotations and measures (e.g., add-aggregate- 
measure, add-scalar-measure, copy-mea-as-ann). The first four categories comprise 
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specialise <\ ■. K x K x 2^ x 2^ ^ K 
generalise O : K x 2^ x 2^ ^ K 
rename-dimension : K x 2^ ^ K 

(a) Manipulating Dimensions 



remove-annotation : K x 2^ ^ K 

rename-annotation : K x 2^^ ^ K 

(b) Manipulating Annotations 
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(c) Manipulating Measures 
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(d) Copying Attributes 



select a ■. K X P ■ 



union 

difference 

natural-join 
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K cartesian-product x ■. K x K ■ 
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(e) Manipulating Cells 
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(f) Higher-Order 
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partition-into-n ^ -.K X N ^ L 



partition-by-dim : K x 2^ 



gen-subsets ■. K x 2'~ 



L gen-supersets A : K x K x 2'^ x 2^ ^ L 



(g) Expanding a K-Cube into a K-Cube-List 
Fig. 4. K- Algebraic Operations (1) 
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relation-to-kcube E0 : _R x 2^ x 2*^ — > AT 
kcube-to-kcube-list K ^ L 



kcube-to-relation □ : A" — >■ i? 
kcube-list-to-kcube □ : L — >■ AT 



(a) Bridging Between Layers 



Fig. 5. K-Algebraic Operations (2) 



operations that reshape a K-cube without direct reference to its cells. These op- 
erations make a K-cube grow or shrink in dimensionality, or be evaluated by 
more or less measures, or be annotated in more or less ways. The operations in 
Fig. 5(b) provide means to assign a different role to a particular attribute in a 
K-cube. For example, copy-dim-as-mea adds a new measure, to a K-cube, whose 
values are replicated from those of an existing dimension. In contrast, the last 
category (in Fig. 8(e)) comprises operations that make direct reference to the 
cells of K-cube, and this explains the analogies with classical database-algebraic 
operations (e.g., select, select-n, and natural-join). This is indicative of the fact 
that the K-algebra can be seen to operate on (possibly aggregated) data (as in 
standard database algebras) but, in addition to that, it wraps that data with 
metadata of interest for the identification of descriptive and predictive models 
of the data. 

The other carrier structure employed in the K-algebra is the K-cube-list, 
which collects K-cubes in (possibly nested) lists. To see the intended structure, 
note firstly that Fig. 9(f) lists operations that either iterate over a K-cube-list 
or are higher-order functions that transform a K-cube-list in some way (e.g., by 
mapping, by filtering, by flattening, or by folding). Fig. 10(g) lists operations 
that expand a single K-cube into a K-cube-list, either by partitioning it (e.g., 
partition-into-N, partition-by-dim) or by replacing it by its super- or subsets (e.g., 
gen-supersets, gen-subsets) in a controlled manner. Finally, all the operations so 
far could be said to be intra-carrier operations. However, the K-algebra is a 
layered algebra in the sense that closure is ensured by inter- carrier operations 
that bridge across layers. 

The layers are as follows. The extensional layer represents the extensional 
data from which K-cubes are initially obtained and to which K-cubes are even- 
tually deployed. The carrier is that of the relational model; the operations stem 
from an relational algebra extended with grouping and aggregation [5]. The K- 
cube layer comprises those operations that manipulate K-cubes one-at-a-time. 
The carrier is the K-cube. The K-cube-list layer provides higher-order and oper- 
ations that manipulate lists of K-cubes. The carrier is the K-cube-list. Thus, the 
operations in Fig. 5(a) allow values to move between layers. This is necessary for 
deriving summaries from, and deploying models in, the extensional data layer, 
as well as for moving from K-cubes to K-cube-lists. 

The case study in Section 3 concretely illustrates all the most important 
operations introduced in Figs. 4 and 5. 
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3 A Concrete Example: Decision- Tree Induction 

This section applies the K-algebra to a concrete example of building an ID3- 
like classifier. Its subsections provide a quick overview of top-down decision tree 
building algorithms, an overview of the algorithmic strategy for constructing the 
classifier in K-algebraic fashion, and how the K-algebraic concepts employed cor- 
respond to the original algorithmic template for the kind of classifiers addressed. 

3.1 Top-Down Decision Tree Inducers 

A decision tree for the weather dataset was introduced in Fig. 2. Decision-tree 
models are the result of a separate-and-conquer approach and classify instances 
into classes. The attribute play constitutes the class one wants to predict based 
on the other attributes, in the given example. The branches of the tree define 
partitions of data and label the instances that traverse them with the class values 
at their leaves. For example, when applied to the weather dataset, it determines 
that all instances in which outlook has the value overcast, the class value (i.e., 
play) be yes. 

A classical top-down decision tree algorithm is IDS [14] . It constructs a deci- 
sion tree from a training dataset by recursively partitioning the data, using the 
attributes but the class, until it is discriminating enough with respect to the class 
values. Ascertaining the latter requirement relies on a measure of the purity of 
a partition (such as the information gain ratio employed in ID3-like techniques 
or the Gini index used in CART [1]) which essentially reflects the variability of 
the class distribution. The algorithm proceeds in a depth-first fashion, using a 
greedy heuristic strategy to decide on the best splits at each node of the tree. A 
high-level procedural specification of ID3 is shown in Algorithm 1. 



Algorithm: ID3(T, D, b, A) 

T: the tree being built 

D-. the current partition of data 

b\ the current branch of the tree 

A- the set of attributes to possibly split D further 

1 if D is pure or A — 9 then 

2 L Build class node for D under b 

3 else 



4 

5 

6 
7 



Using the information gain, determine the best attribute a £ A to split D 
Create a node n for a and add it to T 

Create and add to T a branch bi for each value v in dom(a) 
for each branch bi do 



8 



^ ID3(T, Dp, bi, A) # where Di is the subset of D determined by bi 



Algorithm 1. High-Level Procedural Specification of ID3 
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3.2 An Overview of the Algorithmic Strategy 

In this subsection, the K-algebraic approach to an ID3-like classifier is presented. 
The algorithmic strategy used for that consists of three phases, viz., preparation, 
iteration and eomhination and deployment. Due to lack of space, the description 
here is rather high-level and is only meant to illustrate how the key concepts 
of the algebra are concretely applied to a classical, challenging KDD task. The 
outcome is the construction of (a K-cube that is operationally equivalent to) 
a decision-tree model. Some of the most important operations employed are 
discussed in somewhat more detail. 

Section 2 described the general algorithmic strategy for specifying KDD tasks 
in the K-algebra. For the decision tree inducer, that strategy takes the following 
concrete form: (1) the preparation phase constructs the most detailed K-cube 
{Kbase)^ which other computations will elaborate upon, and the initial K-cube 
{Kciass) to be input into the iterative part, which provides sufficient aggregate 
information with respect to the dimension (play) for further evaluation of in- 
terestingness; (2) the iterative part expands K^iass into different specialisations 
and partitions, evaluates interestingness (based on the Gini index, in this case), 
and, at each step, accumulates those with top interestingness (i.e., smallest Gini 
index); finally, (3) the combination and deployment phase assembles back the 
resulting accumulated K-cubes into one that is the desired (classification) model 
(Kmodei), and deploys it at the extensional layer using the appropriate bridge 
operation. 

Phase 1: Preparation. Algorithm 2 gives the K-algebraic expressions for the 
preparation phase. Its purpose and intermediate steps are as follows. 



Algorithm: K-Algebraic Expressions for Preparation 

// obtain the most detailed K-cube from the extensional data 

1 debase t ^{outlook, humidity, windy, play), {af: Count()) {ntCather H) 

2 Lbase ^ 

// obtain the K-cube with frequency distribution for the dimension play (the class) 

3 i eX(^pla,y) ,(^af:Sum{af)){.dd-has&) 

4 Kciass t— 

5 Lcl ass ^ [Kci ass\ 

Algorithm 2. The Preparation Phase 



Firstly, the most detailed K-cube Kbase (Fig. 6(a)) is obtained by means of 
the corresponding bridge operation relation-to-K-cube (denoted by ffl). This is 
an essential component of the algebraic strategy, as the process typically evolves 
in an less-to-more-specific fashion. Specialisations of K-cubes are obtained at 
each step, which requires the existence of a more detailed K-cube from which 
information is recovered. For instance, take a K-cube with dimensions (outlook, 
play) and a measure of absolute frequency (af). In order to specialise a K-cube 
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into (outlook, humidity, play), it is not possible to de-aggregate the information 
from the former (i.e., af), unless it is recovered from another K-cube with the 
information in non-aggregated form. This explains why operations such as spe- 
cialise and gen-supersets require two K-cubes as arguments: the one to specialise 
and the one from which information is to be recovered. This K-cube is then 
converted into in a singleton list Abase with the operation K-cube-to-K-cube-list 
(denoted by [...]), for the benefit of list-processing operations in the iterative 
phase. 

This phase also comprises the 
construction of the initial K-cube 
to be pumped into the iterative 
part. In lines 3 to 4, the K- 
cube Kciass (Fig. 6(b)) is con- 
structed to yield the class dis- 
tribution, i.e., the absolute fre- 
quency of the dimension play. 

It is obtained by the combina- 
tion of an add-aggregate-measure 
(denoted by a), which computes 
the aggregate values on parti- 
tions of cells determined by the 
distinct values of play, and a 
remove-dimension, which removes 
all the dimensions but play from 
the newly aggregated K-cube. 

The K-cube Kdass is the start- 
ing point for the iterative part, which will process specialisations of it with a 
view towards minimising the Gini index for combinations of the class with other 
dimensions. The initial list of K-cubes to be processed in the iterative part is 
assigned to Ldass, as shown in line 5. 



outlook humidity 


windy play 


af 


overcast high 


false 


yes 


1 


overcast high 


true 


yes 


1 


overcast normal 


false 


yes 


1 


overcast normal 


true 


yes 


1 


rainy 


high 


false 


yes 


1 


rainy 


high 


true 


no 


1 


rainy 


normal 


false 


yes 


2 


rainy 


normal 


true 


no 


1 


sunny 


high 


false 


no 


2 


sunny 


high 


true 


no 


1 


sunny 


normal 


false 


yes 


1 


sunny 


normal 


true 


yes 


1 





play 


af 


no 

yes 


5 

9 





(b) 

i^cla 



(a) Kbase 

Fig. 6. Example Outcome Preparation 



Phase 2: Iteration. Most of the effort in computing the classifier lies in the iter- 
ative phase of this K-algebraic approach. Algorithm 3 presents the corresponding 
K-algebraic expressions, whose output represented by Luer feeds phase 3. 

The iterate operation (denoted by O) characterises this phase. It takes as 
input the initial K-cube-list obtained from the preparation phase (Ac/ass)i and 
also the criteria to decide on when to continue accumulating K-cubes or leave the 
iteration. It is assumed that when a K-cube has the annotation gini evaluated 
to 0, it shall be accumulated and not specialised further. As an example, the 
maximum number of iterations is set to 10. When a pass of the iteration is 
completed, the accumulated K-cubes are assigned to Liter, which is the input to 
the combination and deployment phase. Fig. 7 shows the intermediate results in 
the 3-pass iteration that leads to the desired result. (K-cube subscripts suggest 
which dimensions and subsets of cells led to its derivation.) 
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Algorithm: K-Algebraic Expressions for Iteration 

// plug in the iterate operation the parameters for the termination condition 
// and the initial list to be processed 

-i r j A\Lcurrent ( T , 

± loiter ^^gini=0, 10 \J^classjy 

// generate snpersets, of dimensionality one degree higher, of each K-cube 

y.ki,k2 

2 L\ i (k^ kn^\■^current^ LbaseJ 

// evaluate interestingness 

3 1/2 t— fJ‘intereat(k){Ll) 

// reduce nested lists to choose best of each 

4 1/3 /, . AI/2) 

.gini<.k2 -ginlykl ^k2^ 

// partition by dimensions 

5 1/4 

*\(play)^ > 

II evaluate interestingness 

6 1/5 ■«— IJ.interest(k){-^‘i) 

II flatten K-cube-list 

7 1/6 ^IL (is) 

) 



Algorithm 3. The Iteration Phase 



0th Iteration 



Lcurrent — [Kci a 

Liter ~ [ ] 

1st Iteration 



L2 — [ \LCouti LC}iumi L^windy\ ] 
Z/3 — [ Kout ] 

-^5 — [ [ ^ outra’i ^ outov -I outsu 

Lq = [k 

outrai ^outov •) I^outsu ] 
Lcurrent — [ ^outra^ ^outsu ] 
Liter — f ^outov 1 



2nd Iteration 



L2 — [ [ iLoutraHum-) ^outraWindy [ ^outsuHum^ ^outsuWindy ] ] 

L3 — [ LCourra^Vindy 1 I^outsuHum ] 

L^ — [ [ L^outraWindyf ^ ^outraWindyf [ L^outsuHumhi: LCoutsuHumno ] ] 
Lq — [ KoutraWindyf ^ ^outraWindy f ^ ^outsuHumhi ^ ^outsuHumno ] 
Lcurrent — [ ] 

Liter ~ \ LCoutov -i outraW indy f 1 ^outraWindyf •) LCoutsuHumhi'i L^outsuHun 



Fig. 7. Iteration Phase: Intermediate Results 



K-cubes represent different ways of partitioning the data. The aim of this 
iterative process is to find those that best discriminate the desired class. The Gini 
index is a measure of how good each K-cube is in this regard, and is associated 
with the K-cube in the form of an annotation. All K-cubes processed in this 
phase have the dimension play, since we are interested in obtaining statistical 
information of its interaction with other dimensions. 
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Each iterative step starts by obtaining specialisations of the K-cubes being 
processed. From the initial K-cube with only the dimension play, immediate 
specialisations (with dimensionality one degree higher) are derived, using the 
gen-supersets (v) operation (line 2). These are evaluated, and the most promising 
one, i.e., the one with lowest gini, is kept, by applying a combination of reduce (Z\) 
and select-kcube-on-ann (If) operations. The details of the computation of the gini 
index for the current K-cube-list are abstracted away by a K-algebra equation 
inter esting{K), which is evaluated in lines 3 and 6. Candidate specialisations are 
obtained for each K-cube until it has reached the desired interestingness, at which 
point it is saved and removed from further consideration. For example, in the 
first iteration, L 2 contains the specialisations [ (outlook, play), (humidity, play), 
(windy, play) ]. Each specialisation is evaluated in terms of how discriminative 
they are with regard to the values of play. The one with lowest gini, in this case, 
the one with outlook, is kept and expanded further until the termination criteria 
is reached. Notice that, as a K-cube is an aggregated representation of a partition 
of data, each of its immediate specialisations correspond to sub-partitions of the 
original one, corresponding to what a decision tree builder must do in practise. 

Once the best candidate is retained, the next step is to explore further the 
partitions of data defined by each cell of a K-cube. For example, from the K- 
cube with outlook, distinct K-cubes for the subset of cells corresponding to out- 
look:rainy, outlook:overcast and outlook:sunny are obtained, using the partition- 
by-dimensions (tt'^) in line 5. These have their interestingness evaluated in line 
6. Subsequently, the functionality of the iterative operation guarantees that the 
pure ones are accumulated and the remaining ones are fed back for further re- 
finement. The iteration ends when the K-cube-list L current becomes empty (or 
the maximum number of passes is reached). 



Phase 3: Model Combination and De- 
ployment. The K-cube-list resulting from 
phase 2 collects distinct k-cubes, which were 
judged to be interesting in the iterative pro- 
cess. They provide aggregate perspectives on 
(non-overlapping) partitions of the available 
data to help determine the class values as- 
signed to each of the latter. The purpose of 
this final phase is twofold: (1) to assemble 
these collections of K-cubes back into a single 
K-cube representing the computed classifica- 
tion model, and (2) to deploy the model in the 
extensional layer by means of the appropri- 
ate bridge operation, which labels the dataset 
with the corresponding class values given by 
the cells of the K-cube model. The algebraic 
specification is given in Algorithm 4. 



outlook humidity 


windy 


play 


sunny 


normal 


false 


yes 


sunny 


normal 


true 


yes 


sunny 


high 


false 
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high 


true 


no 


overcast 


normal 


false 


yes 


overcast 


normal 


true 


yes 


overcast 


high 


false 


yes 
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high 


true 
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false 


yes 


rainy 


high 


true 


no 


gini: 0.0000 



Fig. 8. K,ni'- The Final Model 
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Algorithm: K-Algebraic Expressions for Combination and Deployment 



1 

2 

3 

4 

5 

6 

7 

8 



// combine K-cubes into a model and obtain class 

ki^k2 

Kml ^ Q(L7) 

Km2 ^ ^ (playm- play){^ml) 

Krm ^ 

Kmi <— l^(play: 

// deploy model 

Rmodel ^ DiKrr^d) 

Rtest ^ '^*\^play} 

^'predicted ^ Ptnodel ^model 



Algorithm 4. The Combination and Deployment Phase 



The first step is to merge the collection in Luer into a single k-cube using 
union-join, which takes as input two K-cubes and merges them into a single 
schema-compatible (.e., with the same set of dimensions, measures and annota- 
tions) one. Since the operation is binary and on a list, the higher-order reduce 
(zi) operations is used, and the result is assigned to Ly. Ty is then converted 
into the single K-cube representing the classification model and shown in Fig. 8. 
Subsequently, the steps in line 3 and 4 (which correspond to an induction leap) 
output the final model (which can be seen to be equivalent to the tree in 
Fig. 2). The dimensional values of play are replicated as measures in line 3 using 

the copy-dim-as-mea operation. Then, determining which class value is 

representative per partition is handled by the generalise ([>) operation in line 
4, which removes the dimension play. As the K-cube shrinks in terms of its di- 
mensionality, each partition of cells (determined by the remaining dimensions) 
is generalised (i.e., a representative cell is chosen) according to the maximum 
value of rf, as shown by the input of the operation (Max(rf)). In the deployment 
stage, firstly the model is converted into a relation using the kcube-to-relation 
(□) bridge operation. Then, in the extensional layer, the model can be applied 
to any unlabelled dataset (e.g., Rtest) using the relational natural-join to yield 
the predictions in Rpredicted {weather R is a relation representing the weather 
dataset introduced earlier). 



4 Related Work 

There have been proposals for KDD algebras, e.g., the 3W algebra [12] and the 
proposal in [6]. Both are based on constraint database systems. They manip- 
ulate data and models together using a single carrier, but data mining is seen 
as a black box, with a single operation as a wrapper for a specific algorithm. 
Some of the work [13,10] on integrating knowledge discovery and database re- 
search has focused, on a narrow range of tasks, typically association rules, or else 
confines itself to the proposal of a surface language, which fails to throw light 
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on what computations can be defined with it, either in abstract or concretely. 
Some work [9,2] covers a wider range of tasks, but proceeds in a tool set mode 
rather than being based on general principles, and, as such fail to provide formal 
framework for the integration of the two research areas. Finally, researchers [17, 
15] have been attracted to the idea of using user-defined functions for imple- 
menting data mining algorithms that are strongly coupled to database query 
engines, but again only algorithms have been attempted and there is a distinct 
lack of higher-level abstractions. 



5 Conclusions 

This paper has presented two main contributions: (1) the K-algebra and (2) 
a specification of decision-tree induction in the K-algebra. The latter is, to the 
best of our knowledge, the first logical-algebraic specification of a KDD task that 
requires more expressiveness than available in classical database algebras. This 
shows that the K-algebra opens the way for the expression of challenging KDD 
tasks in purely algebraic form. In particular, the K-algebra can express the iter- 
ative aspects of the KDD process and, through its layered design, covers a wider 
range of tasks in the KDD process than previous work. More importantly, the 
K-algebra is a database algebra insofar as it is designed to operate on collections 
(and hence, flows). It should be noted that the K- Algebra has been designed to 
model discovery tasks in the symbolic tradition and, hence, the question as to 
whether it can be used to model tasks that are expressive in, say, connectionist 
of genetically-inspired approaches (e.g., neural networks, genetic algorithms) has 
not yet been investigated. The goal of this proposal is to address the issues iden- 
tified, viz., lack of compositionality, lack of generality and dependency on user 
expertise. The K-algebra response to lack of compositionality is to exhibit clo- 
sure through its layered design. Its response to lack of generality is to be founded 
on generic algorithmic elements (viz., expand, compress, evaluate, accumulate, 
filter out and combine). Its response to dependency on user expertise is to be 
formulated and formalised at a logical, abstract level upon which optimisation 
algorithms can, in future work, be developed. There also remains the important 
task of trying to capture more kinds of discovery approaches^ and investigate 
empirically the practicality of the K- Algebra, especially the development of tools 
that may be necessary for its usability aspect. 
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Abstract. Innovative information systems such as content management 
systems and information brokers are designed to organize a complex 
mixture of media content - texts, images, maps, videos, ... - and to 
present it through domain specific conceptual models, for example, on 
sports, stock exchange, or art history. 

In this paper we extend the currently dominating computational con- 
tainer models into a coherent content-concept model intended to capture 
more of the meaning - thereby improving the value - of content. Inte- 
grated content-concept views on entities are modeled using the notion of 
assets, and the rationale of our asset language is based on arguments for 
open language expressiveness [19] and dynamic system responsiveness [8] . 
In addition, we discuss our experiences with a component-based imple- 
mentation technology which substantially simplihes the implementation 
of open and dynamic asset management systems. 



1 Introduction: On Content-Concept Integration 

Important classes of innovative information systems such as content management 
systems and information brokers are designed to organize a complex mixture of 
media content - texts, images, maps, videos, ... - and to present it through 
domain specific conceptual models [6], for example, on sports, stock exchange, 
or art history. 

Traditional implementations of such information systems reflect this com- 
plexity through their software intricacy resulting from a heterogeneous mix of 
conventional database technology, various augmentations by text, image or geo 
functionality (or just by blobs) and through additional organizational principles 
from domain ontologies [25,5]. 

In this paper we argue for a homogeneous basis for conceptual content man- 
agement and present 

— an integrated content-concept model - based on so-called assets [21] -, 

— an asset language and its conceptual foundation [19], [8], as well as 

— an implementation technology and architecture. 



A. Bencziir, J. Demetrovics, G. Gottlob (Eds.): ADBIS 2004, LNCS 3255, pp. 99—112, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 



100 H.-W. Sehring and J.W. Schmidt 



Our work is inventive essentially due to the following three contributions: 

1. Content is always associated with its concept and represented by assets, i.e., 
content-concept-pairs. In a sense, assets generalize the notion of typed val- 
ues or schema-constrained databases. Assets represent application entities, 
concrete or abstract ones; 

2. Asset schemata are open in the sense that users can change asset attributes 
on-the-fly and any time, thus guaranteeing best correspondence with the 
entity-at-hand; 

3. Asset management systems are dynamic, i.e., the system implementa- 
tion changes dynamically following any on-the-fly modification of an asset 
schema; this requirement demands a specific system modularization and an 
innovative system architecture. 

Our paper is structured as follows: after a short introduction of our asset 
language (section 2) we discuss the essentials of open and dynamic asset-based 
modeling and present an extensive example from the domain of art history (sec- 
tion 3). In section 4 we argue the benefits of asset compilation and its advantages 
for software system construction. The overall modularization and architecture 
of asset-based information systems are presented in section 5. We conclude with 
a short summary and an outlook into further applications of asset-based tech- 
nology. 



2 An Asset Language for Integrated Information 
Management 



Assets represent application entities by content-concept-pairs (fig. 1). Following 
observations of [8] and others, neither content nor a conceptual model of an 
entity can exist in isolation. The conceptual part is needed to explain the way 
content refers to an entity. Content serves as an existential proof of the validity 
of concepts. 




Fig. 1. Assets represent entities by [content | concept]-pairs 
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The content facet of assets is managed through object-oriented multimedia 
container technology. Assets contain handle objects referring to pieces of content. 

To support expressive entity representations the concept facet is given by 
three contributions: 

1. characteristic values, 

2. relationships between assets, and 

3. rules (types, constraints, . . . ). 

Characteristic values describe entities by their immanent properties. Though 
the values may change one value is always assigned. Relationships between assets 
describe entities by their relation to others. Such relations may be changed or - in 
contrast to characteristic values - even be removed. Regular assertions describe 
facts about a set of similar assets. Type and value constraints on characteristics 
and relationships fall in this category. 

Thus, the notion of assets follows closely the theoretical work of [19] (firstness, 
secondness, thirdness) and [8] (indivisibility of content and concept). 

As already argued in the introduction, a conceptual content management 
system has to be based on a responsive dynamic model to adequately represent 
changing entities [21]. For expressive entity description and responsive domain 
modeling we propose an asset language to be employed to notate individual 
asset definitions and their systematics. Our asset language syntax corresponds 
to modern class-based languages [1]. 

An alternative to such a linguistic approach would be the use of a generic 
model to which the intended domain model is mapped. In this case users would 
have to face all the problems implied by an on-the-fly translation between their 
intended entity model and the generic model (“mentally compile and deci- 
pher” [10]). 

As an asset definition language the language is used to define asset classes. 
The following code gives an example of an asset class definition: 

class Equestrian { 

content reproduction : j ava . awt . Image ... 

concept 

characteristic yearOfCreation : java. util. Calendar 
characteristic medium : de .tuhh. sts .wel .Media 
relationship painter : Artist 
relationship epoch : Epoch 
constraint epoch = painter . epoch 

} 

The body of a class definition contains two sections. Under content refer- 
ences to pieces of content are defined by content handles. Their type is given 
in the class definition. Valid types are defined by some object-oriented language 
underlying the asset language. Currently we use Java for this purpose. 
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The conceptual part of entity descriptions is formalized in the concept sec- 
tion. Here the contributions discussed above can be found: characteristics, rela- 
tionships, and rules. 

Definitions starting with characteristic and relationship define at- 
tributes which can be set for any instance of a defined class. Each of these 
is identified by a name. A type for actual values or bindings is given after the 
colon. In the case of characteristics this is a Java class again. In the example 
shown above the standard Java class Calendar and the application-specific class 
Media are used. For relationships an asset class is given which constrains the 
type of assets referred to. If the class name is followed by an asterisk ( ) it is 
a many-to-many relationship binding a set of asset instances. 

Constraints pose value restrictions on the attributes of all instances of one 
asset class. In the above example it is defined that the epoch of the equestrian 
artwork has to be the same as the epoch of the artist (if one is bound). In 
constraint statements all attributes of the current class can be used plus the 
attributes of bound assets. 

In the example the epoch binding for the current Equestrian instance is 
compared to the corresponding binding of the related Artist instance. For this 
to work the asset class Artist needs to define an epoch relationship to Epoch 
assets, just like Equestrian. 

Possible comparators in constraint expressions test for equality (“=”), lesser 
(“<”), greater (“>”), different (“#”), or similar (“^”) values or bindings. How 
the comparator is actually evaluated depends on the compared attributes’ types. 
For characteristics, evaluation is done according to a Java Comparator. For rela- 
tionships the comparisons are mapped to set relations (equality, inclusion, . . . ). 
In both cases, similarity is decided in an implementation-dependent manner (see 
below). Expressions can be combined using the logical operators and, or, and 
not. 

Classes can be defined as subclasses of existing ones using the refines key- 
word: 

class MedievalEquestrian refines Equestrieui { 
concept constraint epoch = middleAges 

} 

This way definitions are inherited by the subclass. Here, a constraint on epoch 
is added. Inherited definitions can be overridden. 

Another way to define asset classes is by giving an extensional definition. 
This is done by naming a set of asset instances which define a class. There are 
two variants of the extensional class definition. The first one gives a fixed set of 
instances which are the complete extent of an asset class: 

class Equestrian definedby { el, e2, e3 } 

This way the asset class Equestriaui is introduced as an enumeration type 
with possible instances el, e2, and e3. 

For the second variant the set of asset instances serves as an example for the 
intended extent of the new asset class. A definition like 
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class Equestrian definedby ^ { el, e2, e3 } 

defines Equestrian to be the class of all instances similar to el, e2, and e3. 
To decide upon similarity, conceptual content management systems incorporate 
retrieval technology [3]. The set of asset instances is used as a training set. 

For the management of asset instances statements of the asset language serve 
as an asset query and manipulation language. A create operation is used for 
the creation of asset instances. The following is an example for the instantiation 
of an Equestrian instance: 

create Equestrian { 

medium := de . tuhh. sts . wel .Media. STATUE 
painter := rubens 

} 

Here, STATUE is a constant class variable of Media holding a Java object for the 
media type “Statue” (singleton [12]). rubens is the name of an Artist asset 
instance. 

A variant of the create operation allows to name a prototype instance in- 
stead of the set of initial bindings: create Equestrian eqProto. 

For updates the modify operation is used. For example, the update of an 
Equestrian instance named eql is done by: 

modify eql { 

medium := de . tuhh. sts .wel .Media. STATUE 
painter := rubens 

} 

A variant similar to that of create allows the naming of a prototype instance 
instead of the set of new value and instance bindings: modify eql eqProto. 

The lookfor operation is used to retrieve asset instances. It searches for all 
instances of a given class. As query parameters all characteristic and relationship 
attributes can be constrained. An example query for Equestrian instances which 
are statues by Rubens is: 

lookfor Equestrian { 

medium = de .tuhh. sts .wel .Media. STATUE 
painter = rubens 

} 

Due to space limitations not all aspects of the asset language can be explained 
in this section. The detailed definition of the asset language can be found in [23]. 

3 Asset-Based Modeling: A Case for Open and Dynamic 
Information Systems 

Asset definitions usually depend on considerations like the state of the entities 
to describe, the users’ expertise, their current task, etc. For various reasons such 
influencing factors may change over time (see [8]): 
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title 


“Bonaparte Crossing the Alps...” 


artist 


Jacques-Louis David : Painter 


regent 


Napoleon 1. : Emperor 


motives 


mountain, alps, horse, hand 


text 


Bonaparte, Hannibal, Carolus 


reference 


Carolus Magnus Crossing the 
Alps, Hannibal Crossing the Alps 



(a) Media Content for the (b) Conceptual Model of one “Equestrian 
Concept “Equestrian Statue” 

Statue” 

Fig. 2. Asset Facets 



— The observed entities change. Thus, their descriptions have to be adjusted. 
This is true even for class definitions because in different states an entity 
is described by different sets of contents, characteristics, relationships, and 
constraints. 

— The users’ expertise influences their information needs. Usually users are 
not willing to (explicitly) provide data they do not consider interesting. 
For communication with others, though, assets need to be tailored to the 
receiver’s needs. 

— A user can view an entity while being in different contexts, e.g., depending 
on the task for which an asset is needed. Different asset definitions may be 
needed when changing context. 

Thus, openness and dynamics as defined in the introduction are important 
properties of conceptual content management systems for a variety of reasons. 

In application projects we observed that knowledge about application entities 
is captured by modeling the processes in which they have been created and used. 
Soft-goals like reasons, intentions, etc. of the creation of entities are recorded in 
such applications (see also [29]). 

In the project Warburg Electronic Library ( WEL) a prototypical open dy- 
namic conceptual content management system has been developed [22] . In appli- 
cation projects our project partners create large numbers of assets modeling their 
domain. One primary application is art history [7]. Our project partners from 
art history use the WEL to pursue research in the field of Political Iconography. 
For content they collect reproductions of artworks (see fig. 2(a)). 

The conceptual part of assets records the historical events which prove that 
the artifact under consideration has been used to achieve political goals. Typical 
information is the creation date and location of a piece of art, relationships to 
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publicize 



Fig. 3. Open Dynamic Asset Use in the WEL for E-Learning 



the regent ordering it and the artist who created it, relationships to other works 
which are influential, information on the way it has been presented, etc. (see 
flg. 2(b)). Sets of asset instances define categories which name political phenom- 
ena. Additional constraints reveal how products of art work for political reasons. 

From a computer science perspective the WEL is an important project for 
understanding the nature of conceptual content management systems. It serves 
as a held study with students and scientists. The WEL has been online for several 
years now [27] . Its services are used by some hundred scientists worldwide, mainly 
from the humanities. In cooperation with our project partners it has been used 
as an e-learning tool in several seminars during the past years. 

As a research tool the WEL maintains a set of assets available to a research 
community. Researchers employ openness and dynamics to model their hypothe- 
ses. They can do so without interfering with others and without allowing them to 
see their results. When there are valuable findings a community can choose to in- 
tegrate the assets of one of its members. For this, the WEL maintains open asset 
models and the corresponding asset instances on a per-user basis. Within scopes 
controlled through group membership asset instances can be shared among users. 

In e-learning scenarios assets are prepared as course material. The body of 
asset instances is maintained for teaching purposes or, as is the case with the 
WEL, research efforts being carried out. Openness is needed by both teachers 
and students [ML95]. Teaching staff can select and adapt assets as teaching 
material for a course to be supported. In seminars and lab classes students can 
modify this material. Such a process is illustrated in flg. 3. This way, students 
get hands-on experience with the definition of concepts, the validation of models, 
the creation of content, etc. 
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4 Information System Constrnction by Asset Compilation 

Domain experts formulate asset models using the asset language introduced in 
section 2. As discussed in the previous section there is a demand for open systems 
allowing the modification of existing assets. For asset management systems to 
meet the openness requirement these are automatically created based on such 
domain models. As part of the asset technology this is done by an asset compiler. 

The compilation process bears a resemblance to modern software-engineering 
approaches like Model Driven Architecture (MDA) [18]. The asset compiler cre- 
ates a platform independent model from a domain model. The platform indepen- 
dent model is then translated into a running software system. The translation of 
domain models is described in this section. The following section concentrates 
on actual implementations. 

As a first step in the compilation process a data model is created from asset 
definitions given in the asset language (comparable, for example, to the interface 
modules of [14]). The current asset compiler uses Java as its target language. The 
data model consists of Java interfaces. Additional parameterizations of standard 
components are created as needed (see next section). Examples are schemata for 
databases or content management systems, XML schemata, etc. 

For each asset class a Java interface is created. Definitions of subclasses are 
mapped to subtypes. The interfaces adhere to the JavaBeans standard [13]. Ac- 
cess methods ( “getter” and “setter” ) are defined for characteristics and relation- 
ships. Class-level (thirdness) contributions are implemented in the operations of 
classes generated according to the interface definitions: constraints are mapped 
to constrained properties which throw a VetoException when the constraint is 
violated (see also [15]). Rules are expressed by bound properties which cause the 
invocation of further methods under the condition set by a constraint. 

The UML class diagram in fig. 4 gives an overview of the generated code. The 
packages lifecyclemodel and implementation contain generic interfaces and 
classes which are part of the runtime environment of a conceptual content man- 
agement system. A package like some. project is generated by the asset compiler. 

The interfaces from package lifecyclemodel reflect possible states in the 
life of an asset instance. They define methods which allow state transitions as 
shown in the state chart in fig. 5. 

Interfaces reflecting an asset model are created as subtypes of those generic 
ones. In fig. 4 the interfaces shown in package some. project are created for a 
defined asset class A. These introduce methods which reflect the asset classes’ 
characteristics, relationships, and constraints as explained above. 

Not shown in fig. 4 are additional interfaces which describe the management 
of asset instances: 

— Class objects carry the asset class definitions into the data model (meta 
level). They offer reflection comparable to object-oriented programming lan- 
guages. 

— Instances are created following the factory method pattern [12]. 
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Fig. 5. State Diagram for the Asset Instance Life Cycle 



— Query interfaces define possible queries to retrieve assets instances. They 
are equipped with methods to formulate query constraints. These constrain 
methods are generated for each defined characteristic and relationship and 
each comparison operator. 

— For collections of asset instances iterators [12] are defined for each asset class. 

The generated interfaces reflect the domain model. The abstract classes from 
the implementation package shown on the right of figure 4 introduce platform 
independent functionality. Classes which implement the interfaces and make 
use of the abstract classes are generated by the compiler. For an example see 
0racle9iTlmpl in package clientmodule in the class diagram (a subpackage 
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of the application-specific package some. project). Sets of classes form modules 
which make up a conceptual content management system. These are described 
in the subsequent section. 



5 Modular System Architectures for Asset Management 

Open modeling allows users to adjust domain models at any time. This may affect 
the model of one user who wishes to change asset definitions, or the models of 
a user group and one of its users, who creates a personal variant of the group’s 
model. 

To dynamically adapt conceptual content management systems to chang- 
ing models they are recompiled at runtime. The demand for dynamics leads to 
system evolution [17]. 

The evolution of conceptual content management systems has two aspects: 

— the software needs to be modified, and 

— existing asset instances need to be maintained. 

Typical issues with respect to these two aspects of evolution are: 

— Changes performed on behalf of individual users should not have any impact 
on others. Therefore, dynamic support for system evolution must not prevent 
continuous operation of the software system. 

— On the one hand, assets as representations of domain entities cannot au- 
tomatically be converted in general. On the other hand, manual instance 
conversion is not feasible for typical amounts of asset instances. 

— If a user personalizes assets for his own needs, he still will be interested in 
changes applied to the original. Through awareness [11] measures he can be 
informed about such changes. To be able to review the changes, access to 
both the former and the current versions are needed. That is, revisions of 
assets and their schemata need to be maintained. 

Crucial for both aspects of evolution - software as well as asset instances - 
is a modularized system architecture. On evolution steps distinguished system 
modules maintain sets of asset instances created under different schemata. They 
are produced by the asset compiler and dynamically added or replaced. 

For our asset technology we identified a small set of module kinds of a concep- 
tual content management system. The conceptual content management system 
architecture supports the dynamic combination of instances of the various mod- 
ule kinds. These modules share some similarities with components [26,2] (com- 
binability, statelessness, . . . ), but in contrast to these they are generated for a 
concrete software system. Modules constitute the minimal compilation units of 
the generated software which the compiler can add or replace. 

Figure 6 shows an example of the evolution of a user’s domain model. As- 
sume that the conceptual content management system shown in fig. 6(a) is in 
operation. It simply consists of one module m as the application layer and a 
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(a) Initial Asset Man- 
agement System 



(b) Added Modules after one Evolution 
Step 



Fig. 6. Asset Management System Evolution Step 



database as the data layer of a layered architecture. Both have been generated 
according to a domain model M . 

If model M is redefined to become model M' the system is recompiled. This 
leads to the generation of additional modules which are incorporated for dynamic 
system evolution. The result is shown in fig. 6(b). First note that the original 
conceptual content management system is maintained as a subsystem of the new 
system version. This way existing asset instances are kept intact. 

In this example a second database is set up to store asset instances following 
model M' . A module m' for accessing the database is created similar to m found 
in the original conceptual content management system. Two further modules are 
added to combine the two subsystems for models M and M' . These follow the 
mediator architecture [28] , an important building block of the conceptual content 
management system architecture. The mapping module rua serves as a wrapper 
lifting assets of M to M' (compare [20]). Mapping issues are covered by a module 
kind of its own rather than integrating it into the other kinds of modules so that 
mappings can be plugged dynamically [16]. The mediation module rrim reflects 
M' in the application layer of the new system version. It routes requests to either 
nria or m' . Lookups are forwarded to both these modules and the results are 
unified. New asset instances are always created in m! according to M' . Update 
requests of instances of M lead to their deletion in the subsystem for M and 
their recreation in m' . 

The two issues with evolution mentioned above are taken into consideration 
by this approach. Preserving the existing software as part of a new system ver- 
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sion leaves it in continuous operation. With the mentioned update policy asset 
instances are incrementally converted when needed. This way, users can perform 
the task of reviewing the asset instances one by one. More sophisticated policies 
might take the mediation module to batch mode when only a certain amount of 
instances is left in the outdated schema. 

In a similar way configurations for other functionality are set up. E.g., to 
store revisions of asset instances these are maintained by distinct subsystems. A 
mediation module takes care of the revision control. 

The above example introduces the three most important module kinds: client 
modules to access standard components managing the assets’ content and data, 
mapping modules to adjust schemata, and mediation modules to glue the mod- 
ules of a conceptual content management system together. These form the core 
of any conceptual content management system. In addition to these, figure 7 
shows the remaining two module types. Distribution modules allow the incor- 
poration of modules residing on different networked computers. Fig. 4 indicates 
two possible implementations in package distribution: the HTTP-based trans- 
mission of XML documents with a schema generated from the asset definitions 
(comparable to the suggestion of [24]) and one for remote method calls using 
SOAP. Server modules (not shown in fig. 4) offer the services of a conceptual 
content management system following a standard protocol for use by third party 
systems. One example is a server module for Web Services [4] with generated 
descriptions given in the WDL [9]. 



6 Concluding Remarks 

Our asset model abstracts and generalizes an essential part of the core experi- 
ence in database design and information system development. Initial applications 
demonstrate that asset-based modeling simplifies information system projects 
and increases the reusability of system functionality. 
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The degree of open schema and dynamic system changeability is substan- 
tially improved by a better understanding of the appropriate architecture and 
modularization of conceptual content management systems. 

In addition we expect that asset-based modeling will greatly improve typical 
standard tasks in information systems administration. The very same method- 
ology used for domain-specific entity modeling may also be applied to software 
entities and, therefore, to information systems themselves. Typical examples 
include naming and messaging services, user and rights management or visu- 
alization tasks. User interfaces, for example, will benefit significantly from an 
asset-based GUI model and UI description. A presentation logic which associates 
assets from the application domain and the GUI realm can then be used by a 
GUI engine to render such UI descriptions and exploit the dynamic openness of 
asset management for user interface adaptation. 



References 

1. Martin Abadi and Luca Cardelli. A Theory of Objects. Monographs in Computer 
Science. Springer- Verlag New York, Inc., 1996. 

2. U. Afimann. Invasive Software Composition. Springer- Verlag, 2003. 

3. Thomas Biichner. Entwurf und Realisierung eines Java-Frameworks zur in- 
haltlichen Erschliefiung von Informationsobjekten. Master’s thesis. Software Sys- 
tems Department, Technical University Hamburg-Harburg, Germany, 2002. 

4. David Booth, Hugo Haas, Francis McCabe, Eric Newcomer, Michael Champion, 
Chris Ferris, and David Orchard. Web Services Architecture, W3C Working Group 
Note. http://www.w3.org/TR/ws-arch/, 11 February 2004. 

5. Alex Borgida and Ronald J. Brachman. Conceptual Modeling with Description 
Logics. In Franz Baader, Diego Calvanese, Deborah McGuinness, Daniele Nardi, 
and Peter Patel-Schneider, editors, The Description Logic Handbook: Theory, Im- 
plementation and Applications, pages 349-372. Cambridge University Press, 2003. 

6. Michael L. Brodie, John Mylopoulos, and Joachim W. Schmidt, editors. On Con- 
ceptual Modelling: Perspectives from Artificial Intelligence, Databases, and Pro- 
gramming Languages. Topics in Information Systems. Springer- Verlag, 1984. 

7. Matthias Bruhn. The Warburg Electronic Library in Hamburg: A Digital Index of 
Political Iconography. Visual Resources, XV:405-423, 2000. 

8. Ernst Cassirer. Die Sprache, Das mythische Denken, Phanomenologie der Erken- 
ntnis, volume 11-13 Philosophie der symbolischen Formen of Gesammelte Werke. 
Felix Meiner Verlag GmbH, Hamburger Ausgabe edition, 2001-2002. 

9. Roberto Chinnici, Martin Gudgin, Jean-Jacques Moreau, Jeffrey Schlimmer, and 
Sanjiva Weerawarana. Web Services Description Language (WSDL) Version 2.0 
Part 1: Gore Language. www.w3.org/TR/wsdl20/, March 2004. 

10. Dov Dori. The Visual Semantic Web: Unifying Human and Machine Knowl- 
edge Representations with Object-Process Methodology. In Isabel F. Cruz, Vipul 
Kashyap, Stefan Decker, and Rainer Eckstein, editors. Proceedings of SWDB’03, 
The first International Workshop on Semantic Web and Databases, Co-located with 
VLDB 2003, Humboldt-Universitat, Berlin, Germany, 7.-8. September 2003. 

11. P. Dourish and V. Bellotti. Awareness and Coordination in Shared Workspaces. In 
Proceedings of ACM CSCW 92 Conference on Computer-Supported Work, pages 
107-114, 1992. 



112 H.-W. Sehring and J.W. Schmidt 



12. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: 
Elements od Reusable Object-Oriented Software. Addison- Wesley, 1994. 

13. Graham Hamilton. JavaBeans (Version 1.01-A). Sun Microsystems, Inc., 8. Au- 
gust 1997. 

14. Manfred A. Jeusfeld. Generating Queries from Complex Type Definitions. In 
Franz Baader, Martin Buchheit, Manfred A. Jeusfeld, and Werner Nutt, editors. 
Reasoning about Structured Objects: Knowledge Representation Meets Databases, 
Proceedings of 1st Workshop KRDB’94, CEUR Workshop Proceedings, 1994. 

15. H. Knublauch, M. Sedlmayr, and T. Rose. Design Patterns for the Implementation 
of Constraints on JavaBeans. In Tagungsband Net. Object Days 2000, Erfurt, 9.-12. 
Oktober. tranSIT GmbH, 2000. 

16. Mira Mezini, Linda Seiter, and Karl Lieberherr. Component integration with plug- 
gable composite adapters. In Software Architectures and Component Technology. 
Kluwer, 2000. 

17. Giorgio De Michelis, Eric Dubois, Matthias Jarke, Florian Matthes, John My- 
lopoulos, Joachim W. Schmidt, Carson Woo, and Eric Yu. A Three-Faceted View 
of Information Systems. Communications of the ACM, 41(12):64-70, 1998. 

18. Joaquin Miller and Jishnu Mukerji. MDA Guide Version 1.0.1. Technical Report 
omg/2003-06-01, OMG, 12th June 2003. 

19. C.S. Peirce. Collected Papers of Charles Sanders Peirce. Harvard University Press, 
Cambridge, 1931. 

20. Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic 
schema matching. VLDB Journal, 10(4):334-350, 2001. 

21. Joachim W. Schmidt and Hans-Werner Sehring. Conceptual Content Modeling 
and Management: The Rationale of an Asset Language. In Manfred Broy and 
Alexandre V. Zamulin, editors. Perspectives of System Informatics, 5th Interna- 
tional Andrei Ershov Memorial Conference, PSI 2003, volume 2890 of Lecture 
Notes in Computer Science, pages 469-493. Springer, July 2003. 

22. J.W. Schmidt, H.-W. Sehring, M. Skusa, and A. Wienberg. Subject-Oriented Work: 
Lessons Learned from an Interdisciplinary Content Management Project. In Alber- 
tas Caplinskas and Johann Eder, editors, Advances in Databases and Information 
Systems, 5th East European Conference, ADBIS 2001, volume 2151 of Lecture 
Notes in Computer Science, pages 3-26. Springer, September 2001. 

23. Hans-Werner Sehring. Report on an Asset Definition, Query, and Manipulation 
Language. Version 1.0. Technical report. Software Systems Department, Technical 
University Hamburg-Harburg, Germany, 2003. 

24. German Shegalov, Michael Gillmann, and Gerhard Weikum. XML-enabled work- 
flow management for e-services across heterogeneous platforms. VLDB Journal, 
10(1):91-103, 2001. 

25. John F. Sowa. Knowledge Representation, Logical, Philosophical, and Computa- 
tional Foundations. Brooks/Cole, Thomson Learning, 2000. 

26. C. Szyperski. Component Software: Beyond Object-Oriented Programming. 
Addison- Wesley, 1998. 

27. Homepage of the Warburg Electronic Library, http://www.welib.de, 2004. 

28. G. Wiederhold. Mediators in the Architecture of Future Information Systems. 
IEEE Computer, 25:38-49, 1992. 

29. E. Yu. Modelling Strategic Relationships for Process Reengineering. PhD thesis. 
University of Toronto, 1995. 



Component-Based Modeling of Huge Databases 



Peggy Schmidt and Bernhard Thalheim 



Computer Science and Applied Mathematics Institute, Kiel University, 
Olshausenstrasse 40, 24098 Kiel, Germany 
thalheim® is . inf ormatik.uni-kiel .de, contactSpeggy-schmidt . de 



Abstract. Database modeling is still a job of an artisan. Due to this 
approach database schemata evolve by growth without any evolution 
plan. Finally, they cannot be examined, surveyed, consistently extended 
or analyzed. Querying and maintenance become very difficult. Distri- 
bution of database fragments becomes a performance bottleneck. Cur- 
rently, databases evolve to huge databases. Their development must be 
performed with the highest care. 

This paper aims in developing an approach to systematic schema compo- 
sition based on components. The approach is based on the internal skele- 
tal meta-structures inside the schema. We develop a theory of database 
components which can be composed to schemata following a architecture 
skeleton of the entire database. 



1 Introduction 

Observations 

Database modeling is usually carried out by handicraft. Database developers 
know a number of methods and apply them with high craftman’s skills. Mono- 
graphs and database course books usually base explanations on small or ‘toy’ 
examples. Therefore database modeling courses are not the right school for peo- 
ple handling database applications in practice. Database applications tend to 
be large, carry hundreds of tables and have a very sophisticated and complex 
integrity support. 

Database schemata tend to be large, unsurveyable, incomprehensible and 
partially inconsistent due to application, the database development life cycle 
and due to the number of team members involved at different time intervals. 
Thus, consistent management of the database schema might become a nightmare 
and may lead to legacy problems. The size of the schemata may be very large, 
e.g., the size of the SAP R/3 schema consisting of more than 21.000 tables. In 
contrast, [MooOl] discovered that diagrams quickly become unreadable once the 
number of entity and relationship types exceeds about twenty. 

It is a common observation that large database schemata are error-prone, 
difficult to maintain and to extend and not-surveyable. Moreover, development 
of retrieval and operation facilities requires highest professional skills in abstrac- 
tion, memorization and programming. Such schemata reach sizes of more than 
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1000 attribute, entity and relationship types. Since they are not comprehensi- 
ble any change to the schema is performed by extending the schema and thus 
making it even more complex. 



Possible Approaches to Cope with Complexity 

Large database schemata can be drastically simplified if techniques of modular 
modeling such as design by wnits [ThaOOa] are used. Modular modeling is an 
abstraction technique based on principles of hiding and encapsulation. Design by 
units allows to consider parts of the schema in a separate fashion. The parts are 
connected via types which function similar to bridges. [FeT02] already observed 
that each large schema can be separated into a number of application dimensions 
such as the specialization dimension, the association dimension, the log, usage 
and meta-characterization dimension, and the data quality, lifespan and history 
dimension. 

Modularization on the basis of components supports handling of large sys- 
tems. The term “component” has been around for a long time. Component- 
based software has become a “buzzword” since about ten years beyond classi- 
cal programming paradigms such as structured programming, user-defined data 
types, functional programming, object-orientation, logic programming, active 
objects and agents, distributed systems and concurrency, and middleware and 
coordination. Various component technologies have been developed since then: 
Source-level language extensions (CORBA, JavaBeans); binary-level object mod- 
els (OLE, COM, COM-I-, DCOM, .NET); compound documents (OLE, Open- 
Doc, BlackBox). We may generalize components to sub-schemata. 

Hierarchy abstraction enables in considering objects in a variety of levels of 
detail. Hierarchy abstraction is based on a specific form of the general join oper- 
ator [Tha02] . It combines types which are of high adhesion and which are mainly 
modeled on the basis of star sub-schemata. Specialization is a well-known form 
of hierarchy abstraction. For instance, an Address type is specialized to the Ge- 
ographicAddress type. Other forms are role hierarchies and category hierarchies. 
For instance. Student is a role of Person. Undergraduate is a category of Stu- 
dent. The behavior of both is the same. Specific properties have been changed. 
Variations and versions can be modeled on the basis of hierarchy abstraction. 

Codesign [ThaOOa] of database applications aims in consistent development 
of all facets of database applications: structuring of the database by schema types 
and static integrity constraints, behavior modeling by specification of function- 
ality and dynamic integrity constraints and interactivity modeling by assigning 
views to activities of actors in the corresponding dialogue steps. First, a skeleton 
of components is developed. This skeleton can be refined during evolution of the 
schema. Then, each component is developed step by step. If this component is 
associated to another component then its development must be associated with 
the development of the other component as long as their common elements are 
concerned. 
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Goals of the Paper 

The paper develops a methodology for systematic development of large schemata. 
Analyzing a large number of applications it has been observed in [ThaOOb] that 
large schemata have a high internal similarity. This similarity can be used to 
reason on the schema in various levels of detail. At the same time, similarity can 
be used for improvement and simplification of the schema. 

At the same time, each schema has building blocks. We call these blocks or 
cells in the schema component. These components are combined with each other. 
At the same time, schemata have an internal many-dimensionality. Main or ker- 
nel types are associated with information facets such as meta-characterization, 
log, usage, rights, and quality information. The schemata have a meta-structure. 
This meta-structure is captured by the skeleton of the schema. This skeleton 
consists of the main modules without capturing the details within the types. 
The skeleton structure allows to separate parts of the schema from others. The 
skeleton displays the structure at a large. At the same time, schemata have an 
internal meta-structure. 



2 Skeletons and Components Within Database Schemata 

Component Sub-schemata 

We use the extended ER model [ThaOOa] for representation of structuring and 
behavior. It has a generic algebra and logic, i.e., the algebra of derivable opera- 
tions and the fragment of (hierarchical) predicate logic may be derived from the 
HERM algebra whenever the structure of the database is given. 

A database type S = {S, O, S) is given by 

— a structure S defined by a type expression defined over the set of basic types 
B, a set of labels L and the constructors product (tuple), set and bag, i.e. 
an expression defined by the recursive type equality 

t = B \ t X ... X t \ {t} I [t] I Z : t , 

— a set of operations defined in the ER algebra and limited to S, and 

— a set of (static and dynamic) integrity constraints defined in the hierarchical 
predicate logic with the base predicate Pg. 

Objects of the database type are S'-structured. Classes are sets of objects 
for which the set of static integrity constraints is valid. 

Operations can be classified into “retrieval” operations enabling in generat- 
ing values from the class and “modification” operations allowing to change 
the objects in the class if static and dynamic integrity constraints are not 
invalidated. 

A database schema T> = (5i, ....,Sm, ^g) is defined by 

— a list of different database types and 

— a set of global integrity constraints. 
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The HERM algebra can be used to define (parameterized) views V = (V, Ov) on 
a schema T> via 

~ an (parameterized) algebraic expression V on T> and 
— a set of (parameterized) operations of the HERM algebra applicable to V. 

The view operations may be classified too into retrieval operations Oy and 
modification operations Oy . Based on this classification we derive an output 
view of V and an input view of V. 

In a similar way (but outside the scope of this paper) we may define trans- 
actions, interfaces, interactivity, recovery, etc. 

Obviously, and are typed based on the type system. Data warehouse 
design is mainly view design [ThaOOa]. 

A database component is database scheme that has an import and an 
export interface for connecting it to other components by standardized interface 
techniques. Components are defined in a data warehouse setting. They consist of 
input elements, output elements and have a database structuring. Components 
may be considered as input-output machines that are extended by the set of all 
states 5'*^ of the database with a set of corresponding input views and a set 
of corresponding output views . Input and output of components is based on 
channels K. The structuring is specified by Sk- The structuring of channels is 
described by the function type : C — ?► V for the view schemata V. Views are 
used for collaboration of components with the environment via data exchange. 
In general, the input and output sets may be considered as abstract words from 
M* or as words on the database structuring. 

A database component /C = (S'jc, /^, O^, 5^, Ak) is specified by 

(static) schema Sk describing the database schema of /C, 

syntactic interface providing names (structures, functions) with parameters and 

database structure for and j 

behavior relating the ,0^' (view) channels 

X {II ^ M*)) ^ V{S<^ X (O^ ^ M*)). 

Components can be associated to each other. The association is restricted to 
domain-compatible input or output schemata which are free of name conflicts. 

Components /Ci = {SiJY.OX,S^,Ai) and IC2 = (82, , 0 ^ , , A2) are 

free of name conflicts if the set of attribute, entity and relationship type names 
are disjoint. 

Channels Ci and C2 of components /Ci = ,OX ,S^ , Ai) and IC2 = 

{S2,i^,o^,s^,A2) are called domain- compatible if 
dom{type{C\)) = dom{type{C2)) ■ 

An output 0 \ of the component /Ci is domain-compatible with an input I2 
of the component IC2 if dom{type{ 0 \)) C dom{type{l2 )) 

Component operations such as merge, fork, transmission are definable via ap- 
plication of superposition operations [Kud82,Mal70]: Identification of channels, 
permutation of channels, renaming of channels, introduction of fictitious chan- 
nels, and parallel composition with feedback displayed in Figure 1. 

The star schema is the main component schema used for construction. A 
star schema for a database type Cq is defined by 
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Fig. 1. The Composition of Database Components 



— the (full) (HERM) schema S = {Cq, C\, Cn) covering all types on which 
Co has been defined, 

— the subset of strong types Ci, Ck forming a set of keys Ki, Kg for Co, 
i.e., Uf^iKi = {Cl, Ck} and Ki ^ Cq , Cq ^ Ki for 1 < z < s 

and card{Co,Ci) = (l,n) for (1 < z < fc) . 

— the extension types Cfc+i,...,Cm satisfying the (general) cardinality con- 
straint card{Co,Cj) = (0,1) for ((fc -|- 1) < z < n) . 

The extension types may form their own (0, 1) specialization tree (hierarchical 
inclusion dependency set). The cardinality constraints for extension types are 
partial functional dependencies. 

There are various variants for representation of a star schemata: 

— Representation based on an entity type with attributes Ci, ..., Cfc and 
Cfc+i, ...., Cl and specializations forming a specialization tree C;+i, ..., C„. 

— Representation based on a relationship type Cq with components Ci, ..., Cfc, 
with attributes Ck+i, ■■■■, Ci and specializations forming a specialization tree 
C;+i, ..., C„. In this case, Cq is a pivot element [BiPOO] in the schema. 

— Representation by be based on a hybrid form combining the two above. 

The schema in Figure 2 is based on the first representation option. It shows 
the different facets of product characterizations. 

Thus, a star eomponent schema is usually characterized by a kernel entity 
type used for storing basic data, by a number of dimensions that are usually 
based on subtypes of the entity type such as Service and Item, and on subtypes 
which are used for additional properties such as AdditionalCharacteristics and 
ProductSpecificCharacteristics. These additional properties are clustered accord- 
ing to their occurrence for the things under consideration. Furthermore, products 
are classified by a set of categories. Finally, products may have their life and us- 
age cycle, e.g., versions. Therefore, we observe that the star schema is in our 
case a schema with four dimensions: subtypes, additional characterization, life 
cycle and categorization. 

Star schemata may be extended to snowflake schemata. Database theory folk- 
lore uses star structures on the basis of a-acyclic hypergraphs [Tha9I,Yu092]. 
Snowflake structuring of objects can be caused by the internal structure of func- 
tional dependencies. If for instance, the dependency graph for functional depen- 
dencies forms a tree then we may decompose the type into a snowflake using 
the functional dependency X for binary relationship types Ron. X^Y with 
card{R,X) = (1,1) and card{R,Y) = (l,rz). 
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Fig. 2. Star Component Schema Product-Intext_Data for the Product Application 



A snowflake schema is a 

— star schema S on Cq extended or changed by 

• variations S* of star schema (with renaming ) 

• with strong 1-n-composition by association (glue) types associating 
the star schema with another star schema S' either with full composition 
restricted by the cardinality constraint card{Ag ,S) = (1,1) or with 
weak, referencing composition restricted by card{Ag , S) = (0, 1) , 

— which structure is potentially Co-acyclic. 

A schema S with a ‘central’ type Cq is called potentially CQ-acyclic if all paths 
p,p' from the central type to any other type Ck are 

— either entirely different on the database, i.e., the exclusion dependency 
p[Cq, Ck] 1 1 p'[Cq, Ck] is valid in the schema 

— or completely identical, i.e. the pairing inclusion constraints 
p[C'o,C'fc] C p'[Co,Cfc] and p[Co,Cfc] 3 p'[Co,Cfc] are valid. 

The exclusion constraints allow to form a tree by renaming the non-identical 
types. In this case, the paths carry different meanings. The pairing inclusion 
constraints allow to cut the last association in the second path thus obtaining 
an equivalent schema or to introduce a mirror type (7(, for the second path. In 
this case, the paths carry identical meaning. 



Skeleton of a Schema 

Skeletons form a framework and survey the general architecture or plan of an 
application to which details such as the types are to be added. Plans show how 
potential types associated to each other in general. 
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The skeleton is defined by units and their associations: 

Units and components: The basic element of a unit is a component. A set of 
components forms a unit if this set can be semantically separated from all 
other components without loosing application information. Units may con- 
tain entity, relationship and cluster types. The types have a certain affinity 
or adhesion to each other. 

Units are graphically represented by rounded boxes. 

Associations of units: Units may be associated to each other in a variety of ways. 
This variety reflects the general associations within an application. Asso- 
ciations group the relation of units by their meaning. Therefore different 
associations may exist between the same units. 

Associations can also relate associations and units or associations. Therefore, 
we use an inductive structuring similar to the extended entity-relationship 
model [ThaOOa]. 

Associations are graphically represented by double rounded boxes. 

The skeleton is based on the repulsion of types. The measure for repulsion can 
be based on natural numbers with zero. A zero repulsion is used for relating 
weak types to their strong types. The repulsion measure r(x, y) is a norm in the 
Mathematical sense, i.e. r{x,y) > 0, r{x,x) = 0, r{x,y) < r{x,z) + r{z,y). The 
repulsion measure allows to build i-shells { T' \ r{T, T') < f } around the type T. 

Repulsion is a semantic measure that depends on the application area. It 
allows to separate types in dependence on the application. We may enhance the 
repulsion measure with application points of view V, i.e., introduce a measure 
ry{T,T') for types T,T' and views V from V. 

These views are forming the associations of the units. Associations are used 
for relating units or parts of them to each other. Associations are often repre- 
senting specific facets of an application such as points of view, application areas, 
and workflows that can be separated from each other. 

Let us consider a database schema used for recording information on 
production processes: 

the Party unit with components such as Person, Organization, the variety of their 
subtypes, address information, and data on their profiles, their relationship, etc., 
the Work unit with components Product, specializations of Product such as 
Good and Service, Consumption of elements in the production process, and the 
WorkESort component which enables in describing the production process, 
the Asset unit consisting of only one component Asset with subtypes such as 
Property, Vehicle, Equipment, and Other Asset for all other kinds of assets, and 
the Invoice unit which combines components Invoicing with InvoiceLineltem 
which lists work tasks and work efforts with the cost and billing model, Banking, 
and Tracking. 

These units have a number of associations among them: 

the WorkTask-WorkAssignment association combines facets of work such 
PartyAllocation, PartyWorkTaskAssignment, TrackOfWorkTasks, and Control- 
lingOIWork Tasks, 
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the PartyAssetAssignment association combines the units Party and Asset, 
the Billing association is relationship among Work and Invoices and has compo- 
nents such as Tracking, Controlling, and Archieving, 

the AssetAssignment association allows to keep track on the utilization of fixed 
assets in the production process and thus associates Work and Asset. 

The four main units can be surveyed in a form displayed in Figure 3. We assume 
that billing is mainly based on the work effort. Work consumes assets and fixed 
assets. 
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Assignment y 

Fixed Asset 
Assignment 

Inventory 
Assignment 
Inventory I, 
RequiremeriW 



I^^Party^^ 
Assignment 

Party Asset 
Assignment 
\\ RequiremenF/y 




Fig. 3. Skeleton of a Schema Supporting Production Applications 



The full schema for the Production application has been discussed in 
[ThaOOb] . It generalizes schemata which have been developed in production ap- 
plications and consists of more than more 150 entity and relationship types with 
about 650 attribute types. 



Meta-characterization of Components, Units, and Associations 

The skeleton information is kept by a meta-characterization information that 
allows to keep track on the purpose and the usage of the components, units, 
and associations. Meta-characterization can be specified on the basis of dockets 
[ScS99] that provide information: 
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— on the content (abstracts or summaries), 

— on the delivery instruction, 

— on the parameters of functions for treatment of the unit (opening with(out) 
zooming, breath, size, activation modus for multimedia components etc.) 

— on the tight association to other units (versions, releases etc.), 

— on the meta-information such as resources, restriction, copyright, roles, dis- 
tribution policy etc. 

— on the content providers, content reviewers and review evaluators with qual- 
ity control policies, 

— on applicable workflows and the current status of completion and 

— on the log information that enable in tracing the object’s life cycle. 



Dockets can be extended to general descriptions of the utilization. The following 
definition frame is appropriate which classifies meta-information into mandatory, 
good practice, optional and useful information: 



header 


content 


name 


developer 


copyright 


problem area 




motivation 


source 


solution 


intention 


also known as 


see too 


variants 


application area 






application 


applicability 


consequences of ap- 
plication 


sample applications 


known applications 


usability profile 


experience reports 


DBMS 




description 


structuring: 


functionality: 


interactivity: 


context: 


structure, static 


operations, dynamic 


story space, actors. 


tasks, intention. 


constraints 


constraints, enforce- 


media objects, repre- 


history, environ- 




ment procedures 


sent at ion 


ment, particular 


implementation 


implementation 


code sample 


associated framework 




associated unit 


collaboration 


integration strategy 




mandatory 


good practice 


optional 


useful 



The frame follows the codesign approach [ThaOOa] with the integrated design 
of structuring, functionality, interactivity and context. The frame is structured 
into general information provided by the header, application characterization, 
the content of the unit and documentation of the implementation. 



3 Composition of Schemata Based on Components 

Composition by Constructors 

We distinguish three main methods for composition of components: construc- 
tion by combination or association on the basis of constructors, construction by 
folding or combining schemata to more general schemata and construction by 
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categorization. It is not surprising that these methods are based on principles 
of component abstraction [SmS77] . Composition is based on the skeleton of the 
application and uses a number of composition constructors. 



Constructor-Based Composition: Star and snowflake schemata may be 
composed by the composition operations such as product, nest, disjoint union, 
difference and set operators. These operators allow to construct any schema 
of interest since they are complete for sets. We prefer, however, a more struc- 
tural approach following [BroOO] . Therefore, all constructors known for database 
schemata may be applied to schema construction. 



Bulk Composition: Types used in schemata in a very similar way can be clus- 
tered together on the basis of a classification. Let us exemplify this generalization 
approach for Ordering processes. The types PlacedBy, TakenBy, and BilledTo 
in Figure 4 are similar. They associate orders with both PartyAddress and Par- 
tyContactMechanism. They are used together and at the same objects, i.e. each 
order object is at the same time associated with one party address and one party 
contact mechanism. Thus, we can combine the three relationship types into the 
type OrderAssociation. The type OrderAssociationClassiHer allows to derive the 
three relationship type. The domains dom(ContractionDomain) = {PlacedBy, 
TakenBy, BilledTo} and dom(ContractionBinder) = V can be used to extract 
the three relationship types. 



Order 

Mentif- Association 
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Fig. 4. The Bulk Composition Within The Order Application 
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Handling of classes which are bound by the same behavior and occurrence 
can be simplified by this construct. In general, the composition by folding may 
be described as follows: 

Given a CentralType C and associated types which are associated by a set of 
relationship types A„} by the occurrence frame F . The occurrence frame 

can be V (if the inclusion constraints Ai[C] C Hj[C] are valid for all 1 < j < n) 
or a set of inclusion constraints. Now we combine the types into 

the type BulkType with the additional component ContractionAssistant and 
the attributes Identif (used for identification of objects in the type Contrac- 
tionAssisitant if necessary), ContractionDomain with dom(ContractionDomain) 
= {Ai, ...,H„} and dom(ContractionBinder) = F. 



Architecture Composition: Categorization-based composition have been 

widely used for complex structuring. The architecture of SAP R/3 often has 
been displayed in the form of a waffle. For this reason, we prefer to call this 
composition waffle composition or architecture composition. Architecture compo- 
sition enables in associating through categorization and compartmentalization. 
This constructor is especially useful during modeling of distributed systems with 
local components and with local behavior. There are specific solutions for inter- 
face management, replication, encapsulation and inheritance. The cell construc- 
tion is the main constructor in component applications and in data warehouse 
applications. Therefore, composition by categorization is the main composition 
approach used for in component-based development and in data warehouse ap- 
proaches. 



Lifespan Composition 

Evolution of things in application is an orthogonal dimension which must repre- 
sented in the schema from one side but which should not be mixed with construc- 
tors of the other side. We observe a number of lifespan compositions: Evolution 
composition records the stages of the life of things and or their corresponding 
objects and are closely related to workflows, circulation composition dis- 
plays the phases in the lifespan of things, incremental composition allows to 
record the development and specifically the enhancement of objects, their aging 
and their own lifespan, loop composition supports nicely chaining and scal- 
ing to different perspectives of objects which seems to rotate in the workflow, 
and network composition allows the flexible treatment of objects during their 
evolution, support to pass objects in a variety of evolution paths and enable in 
multi-object collaboration. 



Evolution Composition: Evolution composition allows to construct a well- 
communicating set of types with a point-to-point data exchange among the 
associated types. Such evolution associations often appear in workflow appli- 
cations, data flow applications, in business processes, customer scenarios and 
during identification of variances. The flow constructor allows to construct a 
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well-communicating set of types with a point-to-point data exchange among the 
associated types. Such flow associations often appear in workflow applications, 
data flow applications, in business processes, customer scenarios and during iden- 
tification of variances. 

Evolution is based on the specific treatment of stages of objects. Things are 
passed to teams which work on their further development. This workflow is 
well-specified. During evolution, things obtain a number of specific properties. 
The record of the evolution is based on evolution composition of object classes. 
Therefore, we define a specific composition in order to support the modeling, 
management and storage of evolution. 



Circulation Composition: Things may be related to each other by life cycle 
stages such as repetition, evolution, self-reinforcement and self-correction. Typ- 
ical examples are objects representing iterative processes, recurring phenomena 
or time-dependent activities. 

Circulation composition allows to display objects in different phases. For 
instance, paper handling in a conference paper submission system is based 
on such phases: Paper Abstract, SubmittedPaper, PaperInReviewing, Accept- 
edPaper, RejectedPaper, FinalVersionPaper, ScheduledPaperPresentation, and 
ArchievedPaperVersion. The circulation model is supported by specific phase- 
based dynamic semantics [ThaOOa]. Circulation forms thus an iterative process. 



Incremental Composition: Incremental composition enables in production 
of new associations based on a core object. It is based on containment, sharing 
of common properties or resources and alternatives. Typical examples are found 
in applications in which processes generate multiple outcomes, collect a range of 
inputs, create multiple designs or manage inputs and outputs. 

Incremental development enables in building layers of a system, environment, 
or application thus enabling in management of systems complexity. Incremental 
constructions may be based on intervals, may appear with a frequency and modu- 
lation. They are mainly oriented towards transport of data. Typical applications 
of the incremental constructor lead to the n-tier architecture and to versioning 
of objects. Furthermore, cooperation and synergy is supported. Typical incre- 
mental constructions appear in areas such as facility management. For instance, 
incremental database schemata are used at various stages of the architectural, 
building and maintenance phases in construction engineering. 

A specialized incremental constructor is the layer constructor that is widely 
used in frameworks, e.g., the OSI framework for communicating processes. In- 
cremental lifespan modeling is very common and met in almost all large applica- 
tions. For instance, the schema^ displayed in Figure 5 uses a specific composition 
frame. The type Request is based on the the type Quote. Requests may be taken 

^ We use the extended ER model that allows to display subtypes on the basis of unary 
relationship types and thns simplihes representation. 
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Fig. 5. Incremental Composition in the Order Schema 



on their own. They are, however, coupled to quotes in order to be sent to sup- 
pliers. Thus, we have a specific variant of quote objects. The same observation 
can be made for types such as Order, Requisition, and Response. 



Loop Composition: Loop composition is applied whenever the lifespan of 
objects is cyclic or looping. They are applied for representation of objects that 
store chaining of events, people, devices or products. The loop composition is 
a companion of the circulation composition since it composes non-directional, 
non-hierarchical associations. Different modes of connectivity may be applied. 
Loops may be centralized and decentralized. 

Loop composition models chaining and change of perspectives of events, peo- 
ple, devices, and other things. Loops are usually non-directional and cyclic. Tem- 
poral assignment and sharing of resources and its record, temporal association 
and integration, temporal rights, roles and responsibilities can be neatly repre- 
sented and scaled by loop composition as long as the association is describable. 



Network Composition: Network or web composition enables in collecting a 
network of associated types to a multipoint web of associated types with specific 
control and data association strategies. The web has a specific data update 
mechanism, a specific data routing mechanism and a number of communities of 
users building their views on the web. 

Networks are quickly evolving. The have usually an irregular growth, are 
built in an opportunistic manner, are rebuilt and renewed and must carry a 
large number of variations. Network composition enables in growth control and 
change management. Usually, they are supported by a multi-point center of 
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connections, by controlled routing and replication, by change protocols, by con- 
trolled assignment and transfer, scoping and localization abstraction, and trader 
architecture. Further export/import converters and wrappers are supported. The 
database farm architecture [ThaOl] with check-in and check-out facilities sup- 
ports flexible network extension. 



Context Composition 

According to [WisOl] we distinguish between the intext and the context of things 
which are represented by object. Intext reflects the internal structuring, asso- 
ciations among types and sub-schemata, the storage structuring and the rep- 
resentation options. Context reflects general characterizations, categorization, 
utilization, and general descriptions such as quality. Therefore, we distinguish 
between meta-characterization composition which is usually orthogonal to 
the intext structuring and can be added to each of the intext types, utilization- 
recording composition which is used to trace the running of the database 
engine and to restore an older state or to reason on previous steps, and quality 
composition which allow to reason on the quality of the data provided and 
to apply summarization and aggregation functions in a form that is consistent 
with the quality of the data. The dimensionality [FeT02] inside schemata allows 
to extract other context compositions. We concentrate, however, on the main 
compositions. All these compositions are orthogonal to the other compositions, 
i.e., they can be associated to any of the compositions. 



Meta-Characterization Composition: The meta-characterization is an or- 
thogonal dimension applicable to a large number of types in the schema. Such 
characterizations in the schema in Figure 2 include language characteristics and 
utilization frames for presentation, printing and communication. Other orthog- 
onal meta-characterizations are insertion/update/deletion time, keyword char- 
acterization, utilization pattern, format descriptions, utilization restrictions and 
rights such as copyright and costs, and technical restrictions. Meta-characteri- 
zations apply to a large number of types and should be factored out. For instance, 
in e-learning environments e-learning object, elements and scenes are commonly 
characterized by educational information such as interactivity type, learning re- 
source type, interactivity level, age restrictions, semantic density, intended end 
user role, context, difliculty, utilization interval restrictions, and pedagogical and 
didactical parameters. 



Utilization-Recording Composition: Log, usage and history composition is 

commonly used for recording the lifespan of the database. We distinguish be- 
tween history composition used for storing and record the log computation history 
in a small time slice, usage scene composition used to associate data to their use 
in business processes at a certain stage, workflow step, or scenes in an application 
story, and structures used to record the actual usage . 
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Data in the database may depend directly on one or more aspects of time. 
We distinguish three orthogonal concepts of time: temporal data types such as 
instants, intervals or periods, kinds of time, and temporal statements such as 
current (now), sequenced (at each instant of time) and nonsequenced (ignoring 
time). Kinds of time are: transaction time, user-defined time, validity time, and 
availability time. 



Quality Composition: Data quality is modeled by a variety of compositions. 
Data quality is essential whenever we need to distinguish versions of data based 
on their quality and reliability: source dimension(data source, user responsible 
for the data, business process, source restrictions), intrinsic data quality param- 
eters (accuracy, objectivity, believability, reputation), accessibility data quality 
( accessibility, access security, contextual data quality (relevancy, value-added, 
timelineness, completeness, amount of information), and representational data 
quality ( interpretability, ease of understanding, concise representation, consis- 
tent representation, ease of manipulation). 

4 Conclusion 

Huge databases are usually developed over years, have schemata that carry hun- 
dreds of types and which require high abilities in reading such diagrams or 
schemata. We observe, however, that a large number of similarities, repetitions, 
and - last but not least - of similar structuring in such schemata. This paper is 
aiming in extraction of the main meta-similarities in schemata. These similarities 
are based on on components which are either kernel components such as star 
and snowflake structures, or are build by application of compositions such as as- 
sociation, architecture, bulk or constructor composition, or lifespan composition 
such as evolution, circulation, incremental, loop, and network compositions, or 
are context compositions such as meta-characterization, utilization or deployment 
and quality compositions. 

Therefore, we can use these meta-structuring of schemata for modularization 
of schemata. Modularization eases querying, searching, reconfiguration, main- 
tenance, integration and extension. Further, re-engineering and reuse become 
feasible. 

Modeling based on meta-structures enables in systematic schema development, 
systematic extension, systematic implementation and thus allows to keep consis- 
tency in a much simpler and more comprehensible form. We claim that such 
structures already can be observed in small schemata. They are however very 
common in large schemata due to the reuse of design ideas, due to the design 
skills and to the inherent similarity in applications. 

Meta-structuring enable also in component-based development. Schemata can 
be developed step-by-step on the basis of the skeleton of the meta-structuring. 
The skeleton consists of units and associations of units. Such associations com- 
bine relationship types among the types of different units. Units form compo- 
nents within the schema. 
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Component-based development enables in industrial development of 
database applications instead of handicraft developments. Handicraft develop- 
ments cause later infeasible integration problems and lead to unbendable, in- 
tractable and incomprehensible database schemata of overwhelming complexity 
which cannot be consistently maintained. We observe, however, the the computer 
game industry is producing games in a manufactured, component-based fashion. 
This paper shows that database systems can be produced in a similar form. 
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Abstract. According to Cognitive Load Theory (CLT), presenting information 
in a way that cognitive load falls within the limitations of working memory can 
improve speed and accuracy of understanding, and facilitate deep understanding 
of information content. This paper describes a laboratory experiment which in- 
vestigates the effects of reducing cognitive load on end user understanding of 
conceptual models. Participants were all naive users, and were given a data 
model consisting of almost a hundred entities, which corresponds to the aver- 
age-sized data model encountered in practice. One group was given the model 
in standard Entity Relationship (ER) form and the other was given the same 
model organised into cognitively manageable “chunks”. The reduced cognitive 
load representation was found to improve comprehension and verification accu- 
racy by more than 50%, though conflicting results were found for time taken. 
The practical significance of this research is that it shows that managing cogni- 
tive load can improve end user understanding of conceptual models, which will 
help reduce requirements errors. The theoretical significance is that it provides 
a theoretical insight into the effects of complexity on understanding of concep- 
tual models, which have previously been unexplored. The research findings 
have important design implications for all conceptual modelling notations. 



1 Introduction 

1.1 Cognitive “Bandwidth” and Cognitive Overload 

One of the most pervasive characteristics of life in technologically advanced societies 
is the growing prevalence of sensory and information overload [31]. The human or- 
ganism can be viewed as an information processing system [39]. Due to limits on 
working memory, this system has a strictly limited processing capacity, which is esti- 
mated to be “the magical number seven, plus or minus two” concepts at a time [3, 34]. 
This has been described as the “inelastic limit of human capacity” [56], “human 
channel capacity” [34] or “cognitive bandwidth” [35], and represents one of the en- 
during laws of human cognition [3]. Working memory is used for active processing 
and transient storage of information, and plays an important role in comprehen- 
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sion [2]. When the stimulus input exceeds working memory capacity, a state of infor- 
mation or cognitive overload ensues and comprehension degrades rapidly [14, 23]. 



1.2 End User Understanding of Conceptual Models 

The understandability of conceptual models is of critical importance in IS develop- 
ment. Conceptual models must be readily comprehensible so that they can be under- 
stood by all stakeholders, particular end users [40]. If end users cannot effectively 
understand the conceptual model, they will be unable to verify whether it meets their 
requirements [30]. If the model does not accurately reflect their requirements, the 
system that is delivered will not satisfy users, no matter how well designed or imple- 
mented it is [36]. The large number of systems that are delivered which do not meet 
user requirements suggests that end users have significant difficulties understanding 
and verifying conceptual models [e.g. 7, 15, 47, 48, 58]. Empirical studies show that 
more than half the errors which occur during systems development are the result of 
inaccurate or incomplete requirements [29, 33]. Requirements errors are also the most 
common reason for failure of systems development projects [47, 48]. This suggests 
that improving understanding of conceptual models should be a priority area for re- 
search. 



1.3 The Entity Relationship (ER) Model 

The Entity Relationship (ER) Model [9] is the international standard technique for 
data modelling, and has been used to design database schemas for over two decades 
[55]. Despite the emergence of object oriented (00) analysis techniques and in par- 
ticular UML, it remains the most popular method for defining information require- 
ments in practice [43, 46, 53]. A recent survey of practice showed that it was not only 
the most commonly used data modelling technique, but also the most commonly used 
IS analysis technique generally [12]. One of the widely quoted advantages of the ER 
Model is its ability to communicate with end users [e.g. 6, 10, 17, 33]. However field 
studies show that in practice, ER models are poorly understood by users, and in most 
cases are not developed with direct user involvement [21, 22]. Experimental studies 
also show that comprehension of data models is very poor and that a large percentage 
of data model components are not understood [40]. While ER modelling has proven 
very effective as a method for database design (as evidenced by its high level of adop- 
tion in practice), it has been far less effective for communication with users [19, 44]. 



1.4 Complexity Effects on Understanding of ER Models 

One of the most serious practical limitations of the ER model is its inability to cope 
with complexity [1, 16, 45, 54, 59, 60]. Neither the standard ER Model or the Ex- 
tended Entity Relationship (EER) model provide suitable abstraction mechanisms for 
managing complexity [60]. In the absence of such mechanisms, ER models are typi- 
cally represented as single interconnected diagrams, often consisting of more than a 
hundred entities. Surveys of practice show that application data models consist of an 
average of 95 entities, while enterprise data models consist of an average of 536 enti- 
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ties [32]. Such models exceed human cognitive capacity by a factor of ten or more, 
which provides a possible explanation for why ER models are so poorly understood in 
practice. 



1.5 Cognitive Load Theory 

Cognitive Load Theory (CLT) is an internationally known and widely accepted the- 
ory, which has been empirically validated in numerous studies [5]. It is based on theo- 
ries of human cognitive architecture, and provides guidelines for presenting informa- 
tion in a way that optimises human understanding by reducing load on working mem- 
ory [27]. CLT simultaneously considers the structure of information and the cognitive 
architecture that people use to process the information [42]. Cognitive load is the 
amount of mental activity imposed on working memory at a point in time, and is 
primarily determined by the number of elements that need to be attended to [49, 50]. 
When cognitive load exceeds the limits of working memory, understanding is re- 
duced. Cognitive load consists of two aspects: 

• Intrinsic cognitive load (ICL), which is due to the inherent difficulty of the infor- 
mation content. This derives from the complexity of the underlying domain and 
cannot be reduced. 

• Extraneous cognitive load (ECL), which is due to the way the information is pre- 
sented. By changing how the information is presented, the level of cognitive load 
may be reduced, which can be used to improve understanding of subject matter. 

Manipulating extraneous cognitive load so that it falls within the limitations of work- 
ing memory has been found to improve both speed and accuracy of understanding [8, 
37], and to facilitate deep understanding of information content [41]. However the 
effectiveness of this depends on the expertise of the audience. People with higher 
levels of expertise can handle larger amounts of information because they have more 
highly developed knowledge schemas, and require fewer elements of working mem- 
ory to understand the same amount of information [49, 50]. Reducing cognitive load 
is likely to have little or no effect on their understanding, and may even have a nega- 
tive effect: this is called the expertise reversal effect [24]. Novices need to attend to 
all elements individually, so will be significantly affected by how information is pre- 
sented. 

The implications of this for representing large ER models are: 

• The root cause of problems in understanding large ER models is not their size per 
se (ICL), which is a function of the underlying domain and cannot be reduced, but 
how they are presented (ECL). 

• Understanding can be improved by manipulating extraneous cognitive load so that 
it falls within the limitations of working memory. 

• End users will be more affected by cognitive overload than analysts: this may help 
to explain why ER modelling has been highly effective for database design but less 
effective for communication with users. 
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1.6 Levelled Data Models 

To address problems in understanding large and complex ER models, a method was 
developed for representing ER models in a way that reduces cognitive load to within 
the limitations of working memory. This is done by dividing the model into a set of 
cognitively manageable subsystems, called subject areas. Each subject area represents 
a subset of the ER model that is small enough to be understood by the human mind. 
The resulting representation is called a Levelled Data Model (LDM), and consists of 
three primary components (Fig. 1): 

• Context Data Model: This provides an overview of the model and how it is divided 
into subject areas. This is shown in pictorial form, with each subject area repre- 
sented by a graphical image. 

• Subject Area Data Models: These show a subset of the model in detail. These are 
represented as standard ER models, With foreign entities used to show relationships 
to entities on other subject areas. Subject areas are limited to “seven, plus or minus 
two” entities, to ensure that working memory is not overloaded. 

• Entity Index: this lists all entities alphabetically with their subject area reference, 
and is used to help locate specific entities within the model. 



Subject Area 
Data Models 




The model may be organised into any number of levels, depending on the size of 
the underlying data model. This results in a hierarchy of diagrams at different levels 
of abstraction. The “seven, plus or minus two” principle is applied at each level to 
ensure that marginal cognitive load (the cognitive load imposed by each diagram) is 
within the limitations of working memory. 



Subject Areas as a “Chunking” Mechanism 

A wide range of experimental studies have shown that humans organise items into 
logical groups or “chunks” in order to expand working memory capacity [e.g. 3, 11, 
34, 38, 39]. The process of recursively developing information saturated chunks is the 
primary mechanism used by the human mind to deal with complexity in everyday 
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life [18]. The ER model already provides one level of chunking, hy allowing attrih- 
utes to be grouped into entities. However this is inadequate in situations of real world 
complexity. A higher level chunking mechanism is needed once the number of enti- 
ties exceeds human cognitive capacity (which is almost always the case in practice 
[32]). The LDM method provides a way of recursively grouping entities into higher 
level chunks (subject areas), which provides a general-purpose complexity manage- 
ment mechanism which is capable of handling any level of complexity. The size of 
these chunks are matched to the processing capacity of the human mind, which en- 
sures that working memory is not overloaded. 

According to cognitive load theory, the grain size of information should be 
matched to the processing ability of the intended audience [52, 57]. ER models are 
intended for communicating with non-experts (end users), which means that the grain 
size (which in this case corresponds to the number of elements per diagram) should 
not exceed seven, plus or minus two. Expert modellers (analysts) will be able handle 
much more complex diagrams because they can use their internal knowledge schemas 
to group elements together, but end users will need to attend to all elements individu- 
ally. 



Cognitive Integration: Understanding Multiple Diagrams 

Diagrams facilitate problem solving by showing all relevant information in a single 
place [20, 28]. Dividing an ER model into subject areas reduces the number of entities 
which must be understood simultaneously, thus reducing cognitive load. However 
because information is distributed across multiple diagrams, it reduces understanding 
of the problem as a whole [61]. Kim et al [25] argue that for information from multi- 
ple diagrams to be integrated in the problem solver’s mind, the notation must explic- 
itly support perceptual integration and conceptual integration processes. The LDM 
representation supports these processes in the following ways: 

• Perceptual integration: Foreign entities provide visual cues to show how elements 
on an SADM are related to elements on all other SADMs. This supports perceptual 
integration by aiding navigation between related diagrams. 

• Conceptual integration: The Context Data Model provides an overview of the sys- 
tem in a single diagram, and thus represents what Wood and Watts [62] call a 
“longshot” diagram. This allows the user to integrate information from multiple 
diagrams into a coherent representation of the system. 



1.7 Research Question Addressed 

This paper describes a laboratory experiment which evaluates the effectiveness of 
reducing cognitive load on end user understanding of large ER models. The broad 
research question addressed is: 

• Are data models represented using the LDM method more easily understood by 
end users than models represented in standard ER form? 

Based on cognitive load theory, we predict that the LDM representation will improve 
both speed and accuracy of understanding. 
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2 Research Design 

2.1 Experimental Design 

A two group, post-test only design was used, with one active between-groups factor 
(Representation Method). The experimental groups consisted of a control group (us- 
ing the standard ER representation) and a treatment group (using the LDM representa- 
tion). Participants were trained in the conventions of one of the methods (experimen- 
tal treatment), and were then given an experimental data model represented using this 
method. They were then given a set of questions to answer about the model {compre- 
hension task) and a description of user requirements against which they were asked to 
verify the model {verification task). 



2.2 Participants 

There were 29 participants in the experiment, all of whom were first year Accounting 
students. Subjects had no prior experience in the use of data modelling techniques 
(which was a condition of selection for the experiment), so were considered as prox- 
ies for naive users. All participated voluntarily in the experiment and were paid $25 
on completion. Subjects were randomly assigned to experimental groups. 



2.3 Materials 

Experimental Data Models 

The experimental data model used in this experiment was a data model taken from 
practice, defining the requirements for a customer management and billing system in 
a utility company. It consisted of 98 entities (109 including subtypes) and 480 attrib- 
utes, so was very close to the average size for an application data model. As well as 
the diagrams, a Data Dictionary was supplied with entity and attribute definitions. 
Two different representations of the data model were prepared for use in the experi- 
ment, one in standard ER form and one using the LDM method. The two sets of ex- 
perimental materials contained exactly the same requirements information but were 
presented differently. In other words, they were informationally equivalent, but not 
necessarily computationally equivalent [28]. 



Comprehension Task 

A set of 25 True/False questions was used to test comprehension performance. Par- 
ticipants were required to answer these questions in the comprehension task. 



Verification Task 

A one page textual description of requirements was used to test verification perform- 
ance. Participants were required to identify discrepancies between the stated require- 
ments and the data model in the verification task. There were 15 discrepancies be- 
tween the data model and the set of requirements. 
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2.4 Independent Variable 

The independent variable is the method used to represent the experimental data model 
(Representation Method). This has two levels, corresponding to the different 
representation methods being evaluated: 

• Standard ER representation (control): this represents the model as a single diagram. 

• LDM representation (treatment): this is the cognitive-load reduced representation, 
where each diagram is of cognitively manageable size. 

The independent variable was operationalised through the experimental treatment 
(training) and the experimental data models. 



2.5 Dependent Variables 

User validation of data models consists of two separate cognitive processes: compre- 
hension and verification [26]. Users must first comprehend the meaning of the model, 
then they must verify the model by identifying any discrepancies between the model 
and their (tacit) knowledge of their requirements. Comprehension performance re- 
flects syntactic or surface-level understanding: the person’s competence in under- 
standing the constructs of the modelling formalism, while verification performance 
reflects semantic or deep understanding: the person’s ability to apply that understand- 
ing. 

In defining measures of understanding, we therefore distinguish between two types 
of performance: 

• Comprehension performance: the ability to answer questions about a data model 

• Verification performance: the ability to identify discrepancies between a data 
model and a set of user requirements. 

We also distinguish between two dimensions of performance: 

• Efficiency: the cognitive effort required to understand a model (this requires meas- 
uring task inputs). 

• Effectiveness: how well the data model is understood (this requires measuring task 
outputs) 

In this experiment, four dependent variables were defined, which cover all aspects of 
understanding performance (Table 1). 

• Dl: Comprehension Time. This was measured by the time taken to perform the 
comprehension task. 

• D2: Comprehension Accuracy. This was measured by the percentage of compre- 
hension questions correctly answered. 

• D3: Verification Time. This was measured by the time taken to complete the veri- 
fication task. 

• D4: Verification Accuracy. This was measured by the number of discrepancies 
correctly identified expressed as a percentage of the total number of discrepancies. 
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Table 1. Classification of Dependent Variables 





Efficiency 


Effectiveness 


Com prehension 
Perform ance 


D1: Comprehension Time 


D2: Comprehension Accuracy 


Verification Per- 
form ance 


D3: Verification Time 


D4: Verification Accuracy 



2.6 Hypotheses 

The research question defined in Section 1 is broken down into several hypotheses, 
each of which relates to a single dependent variable. This results in six hypotheses. 

• HI: Participants using the LDM will perform the comprehension task faster than 
those using the standard ER Model (Comprehension Time) 

• H2: Participants using the LDM will perform the comprehension task more accu- 
rately than those using the standard ER Model (Comprehension Accuracy) 

• H3: Participants using the LDM will perform the verification task faster than those 
using the standard ER Model (Verification Time) 

• H4: Participants using the LDM will perform the verification task more accurately 
than those using the standard ER Model (Verification Accuracy) 

The rationale for these hypotheses is that the standard ER Model lacks adequate 
mechanisms for dealing with complexity. This will result in cognitive overload for 
participants in performing the experimental tasks, as the experimental data model 
exceeds working memory capacity by a factor of more than ten. This will reduce their 
comprehension (surface-level understanding) performance and also their verification 
(deep understanding) performance [4, 31]. The LDM representation, because it organ- 
ises the data model into cognitively manageable chunks, reduces cognitive load to 
within the limitations of working memory. This will improve both speed and accuracy 
of performance compared to the standard ER model [50, 51]. 



3 Results 

An independent samples t-test was used to test for differences between groups. 



3.1 Comprehension Time 

Table 2 summarises the results for Comprehension Time. The difference between 
groups was found to be statistically significant, but in the reverse direction to that 
predicted. Subjects using the standard ER model took significantly less time to com- 
plete the task than subjects using the LDM (a < .05). This means that HI was not 
supported. 



Cognitive Load Effects on End User Understanding of Conceptual Models 



137 



Table 2. Comprehension Time Statistics 



EXPERIM ENTAL GROUP 


MEAN (u) 


STDEV (8) 


Standard ER M odel 


26.67 


7.99 


Levelled Data M odel 


34.00 


6.02 



3.2 Comprehension Accuracy 

Table 3 shows the results for Comprehension Accuracy. Participants using the LDM 
scored 17% better than those using the standard ER model (this represents the unstan- 
dardised effect size). The difference between groups was statistically significant (a < 
.01), which strongly confirmed H2. 



Table 3. Comprehension Accuracy Statistics 



EXPERIM ENTAL GROUP 


MEAN (p) 


STDEV (5) 


Standard ER Model 


69.60% 


9.66% 


Levelled Data M odel 


81.43% 


10.48% 



Comprehension Time Versus Accuracy 

In comparing the results for comprehension time and accuracy, it appears that subjects 
made clear trade-offs between time and accuracy. Subjects using the standard ER 
model performed the comprehension task significantly faster but also less accurately. 
The explanation for this may be that they generated an initial hypothesis, but were 
unable to effectively use the information in the model to refine their hypothesis due to 
the effects of cognitive overload [25]. This would have led to a faster decision, but 
also a less accurate one, as it was primarily based on their initial judgement rather 
than a thorough analysis of problem information. In other words, their inability to find 
relevant (confirming or disconfirming) information increased their speed of response 
but decreased accuracy by a similar margin. Subjects using the LDM representation 
may have spent more time testing and refining hypotheses, which took longer but led 
to a more accurate result. 



3.3 Verification Time 

Table 4 summarises the results for verification time. The mean time taken by each 
group was almost identical, and the difference between groups was not significant (a 
< .05). This means that H3 was not supported. 



Table 4. Verification Time Statistics 



EXPERIM ENTAL GROUP 


MEAN (u) 


STDEV (8) 


Standard ER M odel 


31.87 


7.58 


Levelled Data M odel 


31.71 


7.47 
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3.4 Verification Accuracy 

Table 5 shows the results for verification accuracy. Participants using the LDM 
scored 59% better than those using the standard ER model. The difference between 
groups was statistically significant (a < .01), which means that H4 was strongly sup- 
ported. 



Table 5. Verification Accuracy Statistics 



EXPERIM ENTAL GROUP 


MEAN (p) 


STDEV (5) 


Standard ER M odel 


37.87% 


21.37% 


Levelled Data M odel 


60.06% 


21.78% 



A point worth noting is the low score for the control group on this task: participants 
using the standard ER model scored less than 40% on this task. In practice, ER mod- 
els are often distributed to end users for the purpose of identifying discrepancies be- 
tween the model and their (tacit) requirements. The inability of users to perform the 
verification task effectively may explain the high level of requirements errors reported 
in practice [29, 33]. 



Verification Time Versus Accuracy 

In the comprehension task, speed and accuracy were inversely related, while on this 
task, both groups took almost exactly the same time. This suggests that the apparent 
tradeoffs between time and accuracy on the comprehension task were due to other 
reasons. Most subjects were able to finish the experiment well within the allotted 
time, so there was little need for subjects to make time-accuracy trade-offs. An alter- 
native explanation is that in the face of overwhelming complexity, subjects in the 
control group resorted to guessing some of their answers in the comprehension task. 
Because the comprehension questions were all in True/False format, participants had 
a 50% chance of getting the answer right by guessing. This may be a particular prob- 
lem with using students as participants, as they are likely to understand this “law of 
probabilities” very well! In the verification task, guessing was taken out of the equa- 
tion, which would explain why both groups took a similar amount of time to perform 
the task. 

If this interpretation is correct, the comprehension results overstate the perform- 
ance of both groups, because participants would have scored 50% based on chance 
alone. When the scores are adjusted for chance (Table 6), the results obtained are 
almost identical to those for Verification Accuracy which provides support for this 
interpretation of the data . Using this measure, participants using the LDM representa- 
tion scored 63% better than those using the standard ER model, which is similar to the 
difference found on the comprehension task. 

Table 6. Comprehension Accuracy Adjusted for Chance 




This suggests a systematic flaw in previous experimental studies which have 
evaluated comprehension of data models, almost all of which have used True/Ealse 
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questions. Data models naturally lend themselves to questions in a True/False format 
e.g. Can a project have many employees? Can an employee manage many projects? 
However True/False questions introduce a significant amount of measurement error, 
as subjects can score 50% simply by guessing. A better approach in future may be to 
use multiple choice questions, with an “Unable to Tell” option. 



4 Conclusion 

4.1 Summary of Findings 

Of the four hypotheses originally proposed, two were supported (H2, H4), one was 

not supported (H3), with a reverse finding in one case (HI). The conclusions are that; 

• HI (contra): Subjects using the LDM representation took longer to comprehend the 
model. 

• H2: Subjects using the LDM representation were able to more accurately compre- 
hend the model. 

• H4: Subjects using the LDM representation were able to more accurately verify the 
model. 



Speed of Understanding 

Neither of the hypotheses relating to speed of understanding (HI, H4) was supported, 
and the reverse result was found for HI. The conclusion from this is that reducing 
cognitive load has no effect on speed of understanding of data models, as the reverse 
result obtained for HI was attributed to guessing. This is inconsistent with the predic- 
tions of cognitive load theory, but may be explained by the graphical nature of the 
subject matter. According to Kim et al [25], distributing information across multiple 
diagrams results in additional cognitive overheads to search for relevant information 
and integrate it together. Thus any increase in speed of understanding as a result of 
the reduction in cognitive load may have been offset by the time taken to navigate 
between different diagrams. 



Accuracy of Understanding 

The significant improvement in both comprehension and verification accuracy using 
the LDM representation provides strong evidence that reducing cognitive load im- 
proves both surface (syntactic) understanding and deep (semantic) understanding of 
data models. By dividing a large ER model into chunks of cognitively manageable 
size, the LDM representation reduces extraneous cognitive load to within the limita- 
tions of working memory, thereby improving understanding. This is consistent with 
the predictions of cognitive load theory. Since the experimental material for the two 
groups was informationally equivalent, the way the models were presented must have 
been responsible for the difference in performance between the experimental groups. 
This suggests that decisions regarding the presentation of conceptual models are far 
from trivial and should be approached with as much care as decisions on their content 
[20, 25,40]. 
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4.2 Practical Significance 

This research has important implications for conceptual modelling practice. The ex- 
perimental results show that reducing cognitive load improves end user comprehen- 
sion and verification of ER models by more than 50%. Improving end user under- 
standing and verification will help to increase the accuracy of conceptual models in 
practice, and thereby reduce incidence of requirements errors and increase the likeli- 
hood that systems will meet user requirements. As the ER model is the most com- 
monly used requirements analysis technique [12], this research has the potential to 
significantly improve requirements analysis effectiveness. 



4.3 Theoretical Significance 

This is the first experimental analysis of the effects of cognitive overload on under- 
standing of conceptual models. The largest model used in any previous experimental 
study of data model understanding is 15 entities, which means the effects of complex- 
ity on end user understanding have been previously unexplored. The results provide 
strong evidence for the existence of cognitive overload effects on end user under- 
standing of ER models, which provides a possible explanation for why they are so 
poorly understood in practice. CLT is used to provide a theoretical explanation for 
why end users have difficulties understanding large ER models and also why analysts 
do not have similar difficulties. The experiment also provides evidence that reducing 
extraneous cognitive load can improve end user understanding of conceptual models. 



4.4 Wider Significance 

Conceptual models are used for understanding a system and communicating this un- 
derstanding to others. For this reason, it is essential that conceptual modelling nota- 
tions take into account the characteristics of human information processing as well as 
the characteristics of the problem situation [7, 20, 25]. One of the most important 
constraints on human information processing is the capacity of working memory: in 
an IS context, the sheer volume of requirements information can quickly overwhelm 
working memory capacity [7, 13]. This is an issue which has been previously unex- 
plored in conceptual modelling research. The findings of this experiment thus have 
general implications for developing more effective conceptual modelling techniques. 
In order to improve understanding, conceptual modelling notations should be de- 
signed to reduce cognitive load to within the limitations of working memory. This is 
particularly important when diagrams are used to communicate with non-technical 
people (as is almost always the case in conceptual modelling), as they are likely to be 
more affected by cognitive overload [49, 51]. 
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Abstract. This paper proposes a process of applying design patterns to 
design model of a software system. The process facilitates multi-step de- 
sign pattern instantiation involving interaction with a designer. It results 
in enrichment of software design model with explicit and implicit infor- 
mation about interplaying entities forming a meaningful design pattern 
instance. The pattern template models significant aspects of design pat- 
tern and provides set of constraints to make possible to allow guiding the 
designer through the process of instantiation. Emphasis has been put on 
supporting the interactive, iterative and incremental instantiation pro- 
cess. The designer is prompted with a list of tasks based on pattern tem- 
plate and momentary state of each pattern instance. Designer’s actions 
consequently alter the state of particular pattern instance and comple 
the closed loop of the process. As a demonstration we have developed a 
prototype CASE tool. 



1 Introduction and Motivation 

Recent advances in software engineering made the usage of design patterns a 
sound engineering practice. Using design pattern is a preferred way of solving 
recurring problems among development teams. It allows to avoid “reinventing 
the wheel” and to ease communication of partial solutions that are specific to 
the problem being solved. 

Pattern catalogs are collections of generally usable patterns [1,2] or patterns 
specific for certain application domain or development activity [3,4]. Significant 
effort is being put into discovering, defining particular design pattern and de- 
scribing its various aspects to capture the essence of recurring solution in order 
to make possible to apply existing knowledge repeatedly. By applying a pattern 
to a new recurrence of the problem, or by instantiating it [5] , one gets an instance 
of the pattern adapted to be the solution of the problem in particular context. 

There are many concerns being addressed by scientific community regarding 
usage of design patterns. They range from identification of design patterns in 
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No. VG 1/0162/03. 

A. Benczur, J. Demetrovics, G. Gottlob (Eds.): ADBIS 2004, LNGS 3255, pp. 144-158, 2004. 

© Springer- Verlag Berlin Heidelberg 2004 



Template Based, Designer Driven Design Pattern Instantiation Support 145 



legacy software models (e.g. for improving documentation), through advising 
developers to choose the right pattern in the right context, to helping to apply the 
pattern correctly and to obtain consistent pattern instance. This is a broad area 
of problems demanding different approaches; therefore we decided to address the 
latter task, i.e. to focus on supporting pattern instantiation process by assisting 
the developer. 

Applying the idea behind the design pattern in a real-world case is not a triv- 
ial task. A design pattern is generally a complex structure consisting of several 
entities with different responsibilities, which should be embedded into software 
model. This is done by mapping those responsibilities to actual elements of soft- 
ware model. Additionally, existing pattern instances should become a part of 
software model being developed in order to become the benefit of documenta- 
tion itself. 

To relieve the developer from complexity of creating and maintaining pattern 
instances, it is important to provide tool support to enable efficient exploitation 
of knowledge gathered and published in form of design pattern. 

In practice, design problems are addressed in the design phase of software 
development lifecycle, but standard modeling techniques like Unified Modeling 
Language [6] do not provide efficient way to capture and store information about 
solutions based on design patterns. Mostly, design patterns are applied in de- 
sign phase in a way that turn them into specific solutions, but there is missing 
explicit information about design pattern instance. Alternatively, solving well 
known design problems is being postponed into implementation phase, thus im- 
plementing software and applying design pattern take place concurrently, often 
by means of assisted refactoring [7]. Unfortunately, such approach adds little 
value except reusing proven implementation solutions. The benefit of improved 
communication and documentation does not apply, because design pattern in- 
stance gets lost in vast implementation model (in this case, it is actual code in 
arbitrary programming language) . 

Obviously, the sooner the design pattern is applied, the greater is the bene- 
fit from additional informations. This is due to the fact that such an approach 
allows better design reuse, because the same design may be shared by several 
implementations or subsequent development process iterations. Facts mentioned 
above lead to the conclusion that it is beneficial to facilitate design pattern in- 
stantiation in design phase, and to retain explicit information about existing 
instances in design model. Therefore in our research we have focused on sup- 
porting design phase of software development. 

The atomic approach which is suitable for working with primitives like 
classes, operations, attributes and relations among them, may not be suitable for 
creating more complex structures that represent solution induced from design 
pattern description called design pattern instance. These structures consist of 
primitives, but should also carry additional information about the way how these 
primitives play together to form a solution. Moreover, patterns instances are not 
isolated in the design model, they are usually overlapped and some primitives 
may participate in several design pattern instances. 
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We adopted the three basic properties of design pattern instantiation support 
method as stated in [8] and applied them: 

— Interactive. The interactivity involves multiple interactions with designer 
during formation of pattern instance. Moreover, the process should also be 
able to suggest possible steps of instantiation. For example, we found very 
useful a possibility to generate primitives taking part in selected pattern in- 
stance. Such primitives may be consequently extended to carry out tasks spe- 
cific to developed system. As the design pattern described in discovery phase 
includes constraints on individual primitives (e.g. “the operation/method 
should be abstract”), we can use this information to set properties of gener- 
ated primitives that may be deduced from design pattern description. 

— Incremental. The pattern may not be instantiated at once. Desired instance 
may be too complex to make considering all facets of the problem at once 
impossible. Moreover, the requirement of incremental process is implied by 
need to evolve the pattern instance as designer proceeds with the design 
work and/or accepts new requirements. 

— Iterative. Most activities of software engineering are iterative, thus it is 
natural to go back to improve design, to fix mistakes or to solve several 
problems simultaneously. 

The rest of this paper is structured as follows. In Section 2, we briefly intro- 
duce several basic concepts that need to be explained before going on. Section 
3 describes proposed design pattern instance representation. Section 4 focuses 
on the main goal of this paper, the mechanisms of instantiation. In Section 5 we 
describe a CASE tool prototype we have created to evaluate our approach. We 
conclude with related work in Section 6 and some conclusions and future work 
in Section 7. 



2 Basic Concepts 

It is often said that design patterns are discovered rather than invented [3] . Ex- 
isting solutions of similar problems are abstracted and described. Our approach 
takes such description as a basis enabling to apply design pattern to solve new 
recurrence of the problem of which the pattern is an abstracted solution. Due to 
fact that the pattern is instantiated in design model, we refer to the person who 
instantiates it as a designer. 

In this paper we consider work with UML model. More specifically class 
model is the target in which the design patterns are instantiated. The motiva- 
tion stems from availability of design pattern description based on UML/OMT 
diagrams and by ubiquity of class diagrams as a foundation of object-oriented 
design. Design patterns in catalogs are often described by defining their static 
structure in the form of class diagrams. 

The following subsection introduces design pattern representation which is 
crucial for proposed instantiation process. 
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2.1 Design Pattern Representation 

Suitable formal description of design pattern which captures the most impor- 
tant aspects and properties of the pattern is crucial for developing instantiation 
support method. The presented approach proposes pattern template to be such 
a description. The pattern must be represented as pattern template in order to 
be instantiated using the suggested approach. 

Role modeling approach [9] has been used to realize mapping of pattern 
elements to target model elements. Each responsibility in design pattern corre- 
sponds to a role. Each role is typed to enforce the type of design model primitive 
(class, operation or attribute) which can be cast to that particular role. Cast- 
ing is equivalent to making mapping between role and particular design model 
primitive. 

The pattern template inherits existing knowledge stored as a description in 
a pattern catalogs and enriches it with additional information. Information is 
added to enable definition of a set of possible outcomes that can be achieved by 
instantiating the pattern. It also defines the way in which the resulting pattern 
instance integrates into the model of the system being designed. 

Pattern template consists of three basic components: pattern structure, role 
graph and compatibility matrix. Each component carries distinctive knowledge 
about design pattern as follows: 

— Pattern Structure. It is a description of structural and partly behavioral 
properties of pattern instance from the pattern catalog. There is significant 
effort to factor out as much metamodel dependencies as possible from design 
pattern representation, therefore almost all the metamodel specific proper- 
ties are concentrated in pattern structure. To form a structural model, it 
is required to identify roles, to remove duplicate primitives representing the 
same role and to rename them to represent the name of role instead of a 
concrete actor. The design pattern structure then defines the properties of 
actual design primitives cast in roles and the type of relations between them 
(e.g. relation of generalization, association in the case of UML model). 

— Role Graph. The role graph defines dependencies between roles, and con- 
straints the quantity of instances of any role in regard to any other role. This 
property is called multiplicity as defined in [10]. It is the lower and upper 
bound to the number of contracts each role may possess. Cited approach has 
been extended to allow representation of relative multiplicity. Dependencies 
in role graph are directed edges; each endpoint is characterized by multiplic- 
ity constraint, similar to UML relation multiplicity constraint. An example 
role graph for prototype pattern is shown in Fig. 1. 

If there are dependencies in both directions between any two roles, then 
there is only one edge drawn between them without arrows, thus denoting 
it is a bidirectional dependency. 

— Compatibility Matrix. The knowledge of which roles may be combined 
(in other words they are compatible) is included neither in pattern template 
structure nor in the role graph. The compatibility matrix defines which pairs 
of roles are compatible. The notion of compatibility is similar to notion of 
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Fig. 1. Prototype pattern role graph 



role constraint matrix as described in [9]. If the two roles are compatible, 
then the same model primitive may be cast into both roles of the same 
pattern instance. 



3 Pattern Instance Representation 

Setup and maintenance of information about existence of design pattern instance 
in the software design model is very important. In order to fulfil the requirement 
of interactivity, the pattern instance should represent not only grouping of prim- 
itives forming it, but should also offer steps which the designer may wish to take 
to develop the design further. 

The pattern instance is represented by an instance graph. It is defined as 
an undirected graph whose nodes are role instances and edges are dependency 
instances. It defines momentary state of each pattern’s instance and evolves 
with each step the designer takes. Every instance graph in the software’s model 
represents single distinct design pattern instance. The example of instance graph 
is shown in Fig. 2. The graphic representation describes a sample instance of 
Decorator pattern. 



(~ ~ Z S I I Cast role 

I Component.Operation I I 1 

' ' Q Optional vacant role 




Mandatory vacant role 



ecorator.Operation 



Fig. 2. Instance graph of example Decorator Instance 
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There are three basic states of a node: 

— Optional vacant role (e.g. Component. Operation, Decorator. Operation) 

— Mandatory vacant role (e.g. ConcreteComponent, ConcreteDecorator) 

— Cast role (e.g. VisualComponent, Decorator) 

Figure 3 shows a statechart depicting transitions between node states. Vacant 
role forms an offer to cast the role. Cast roles are created by designer by accept- 
ing an offer and making new contract. However, iterative and incremental nature 
of proposed instantiation process demands the ability to cancel existing contract, 
effectively removing cast role and updating the instance graph accordingly. Va- 
cant roles are generated according to rules derived from Pattern template. These 
rules are described in detail in next section. 



4 

Creation 


Vacant 
Mandatory ^ 

.f-. 


Desti'uction 








I 


Accept offer 








Optional 


Caned contract 


— 


Cast 1 



Fig. 3. Role instance lifecycle 



There are two kinds of offers: optional and mandatory [8]. Mandatory offers 
must be turned into cast roles in order for pattern instance to meet multiplicity 
requirements as defined by role graph. 

The edges of the instance graph are dependency instances and their existence 
is crucial for evaluation of actual relative number of contracts. 

The pattern instance itself is also characterized by its state, which is up- 
dated after each change in the structure of representing graph. We have adopted 
following two notions to define the state of the pattern instance as introduced 
in [11]: 

— Completeness. Complete pattern instance does not contain mandatory va- 
cant roles. The bounding condition for completeness is as follows: For each 
cast role c, for each role graph dependency d ending in role c: 

— |-^d(^)l — (1) 

where \Id{d) \ is number of instances; irimin is lower and rrimax is upper bound 
of multiplicity of role graph dependency d. 

— Correctness. Correctness bounds the number of created instances of each 
role. The condition simply disallows to create more contracts than it is dic- 
tated by the role graph. For each cast role c, for each role graph dependency 
d ending in role c: 



\Id{d)\ < m max id) 



( 2 ) 
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We have abandoned the notion of inconsistency as described in [11], thanks 
to strong control applied to the process of instantiation, which does not allow 
the designer to create instances not satisfying the condition of correctness. Also, 
each change to class model results in change of the pattern instance graph, if 
it is applicable (e.g. deleting a class participating in pattern instance results in 
instance graph update). 

4 Pattern Instantiation Process 

The heart of the proposed method is the instantiation process. It is the process 
whose input is a pattern template. The process results into pattern instance inte- 
grated into class model of software system being developed and into an instance 
graph which maintains relationships between role instances. The process consists 
of multiple iterative steps and is driven by interaction with designer. Conceptual 
block scheme in Fig. 4 shows the inputs and results of instantiation process. 



Role graph 



Compatibility 

Matrix 



1 



Pattern 

structure 



\ 



Software modei 



Instantiation 

process 



Class model 






Instance 

graph 



Fig. 4. Instantiation block diagram 



Pattern structure, which is based on description of the design pattern, is a 
basis for how the actual pattern instance will look in the target class model. It is 
complemented by role graph and compatibility matrix that provide the rules and 
mandatory constraints for controlling the formation of an instance. Instantiation 
results in a design pattern instance in its implicit (classes with responsibilities, 
properties and relations based on pattern structure) and explicit form (instance 
graph). 

4.1 The Designer’s View 

Iteration of pattern instantiation process (Fig. 5) is bounded by the action of de- 
signer. Designers are allowed to make new contracts as well as to cancel existing 
contracts. 
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Fig. 5. Iterative instantiation process 



The designer is prompted with the list of cast and vacant roles - a task list. 
It is in fact different graphical representation of an instance graph. Roles are 
displayed as a two level nested list. At the first level, class role instances are 
listed. At the second level, there are listed attribute and operation role instances 
grouped by encapsulating class. Concrete examples of task lists are shown in 
Fig. 7. 

The rationale behind transforming graph structure into a task list lies in 
improved quick navigation the list provides. 

Each edge of the role graph may be annotated with a short description of 
the responsibility of depending roles [10]. E.g. dependency of ConcretePrototype 
from Prototype may be read: “implements interface of cloning” and this will be 
expanded into the hint: “Vacant role ConcretePrototype implements interface of 
cloning Graphic playing role of Prototype” in the list of offers. The hint is placed 
between dependee and the name of the role the dependee depends to with names 
of concrete primitives substituted. The goal is to enlighten the purpose of vacant 
roles within the pattern instance to provide additional guidance for the designer. 

4.2 Instance Graph Adjustment 

All iterations of the process include adjustment of instance graph. This adjust- 
ment adds possible offers to the graph, which do not break rules of relative 
multiplicity. Therefore the condition of correctness (2) is checked each time the 
pattern instance is about to be updated. Note that the conditions of correctness 
and completeness are related to cast roles only. There may be any vacant roles, 
but casting any one of them may not break the rule of correctness. 

The simplified skeleton algorithm of adjustment is shown in Fig. 6. 

The initial adjustment includes offering vacant instances for all roles. These 
role instances form the origin to allow the designer to start casting roles from. 

A weak side of the above mentioned approach is the way it copes with am- 
biguity of role relationships. When there is large number of possibilities how to 
add an instance, the graph adjustment results in creating a vacant role for each 
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repeat 

for all role graph dependency d do 
s <r- start point role of d 
e <r- end point role of d 
for all Si ■«— cast role s do 

if I Id I ending in Si > rrimax then 
continue 
end if 

for all a <— instance of e do 

if adding dependency instance of (si, d) does not generate duplicate role 
instance d then 

add dependent instance (si, d) 

end if 
end for 

if creating of d -4— vacant role of e and adding dependency instance of 
(si, d) does not generate duplicate role instance d then 
6i 4— new vacant instance of e 
add dependent instance (si, d) 

end if 
end for 
end for 

until no dependency instances added 



Fig. 6. Adjusting of instance graph 



combination of depending roles. Example of such a situation is when there are 
roles A, B, C. B depends on A and C. If the dependency type is “one-to-one” 
the result is creation of a vacant instance of B role for each combination of A 
and C instances. This is desired behavior. On the other side, if the A to B and 
C to B relationships are “many” , then there is only one B vacant role of B cre- 
ated. This happens, because the algorithm is unsure what subset of A and C 
instances should B instance depend on. The possible resolution of this caveat is 
being researched (e.g. by considering transitive evaluation of multiplicities that 
takes also dependencies of A and C with each other into account) . 

4.3 Accepting and Canceling Contracts 

Both accepting and canceling contracts is a designer initiated activity resulting 
in changing the internal state of an instance. Necessary checking of constraints 
is part of them. Existing or newly created primitive may be cast into role. The 
process of casting comprises of: 

1. Type check. Only primitives of the role type are allowed to be cast. 

2. Compatibility matrix check. It refuses to cast role so that compatibility 
matrix constraint would be violated. 

3. Changing the role instance state. The state of the node is set to cast. 
This will trigger adjusting the instance graph for next iteration. 
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4. Setting properties on primitive. As pattern structure of pattern tem- 
plate uses the same metamodel as target model, the properties of primitive 
being cast to role are copied from the template. This step includes recur- 
sive checking of transitional generalization/realization relationship in pat- 
tern structure. If there is such relationship between the two primitives in 
pattern template, identical relationship is enforced between instances cast 
into respective roles. 

The contracts may be cancelled only if role instance does not contain any en- 
capsulated role instances. Of course with every role cast there are all properties 
specific to that role added to the class design model (e.g. class stereotype, UML 
relationships with other primitives cast into depending roles). 

4.4 Example 

An example of two successive steps of Prototype pattern instance formation is in 
Fig. 7, the example is taken from [1]. It is a fragment of musical notation editor 
design model. In the first step shown (marked as Step n), there is an excerpt 
from target class model, which includes only entities related to this particular 
pattern instance. It is a situation after certain number of steps of instantiation 
process. 

There are two cast roles. Class GraphicTool is cast in role Client and class 
Graphic is cast in role Prototype. The instance graph reflects actual state of 
an instance. There are optional vacant roles for another Client, as well as for 
operation, which may be cast to role Client. Operation. Moreover, there are two 
mandatory vacant roles: ConcretePrototype and Prototype. Clone. Designer must 
cast these two roles in order to make pattern instance complete (grayscale coding 
is identical as in Fig. 2) . 

The next step (n-|-l) shows the situation after the designer decides to cast 
newly created class Staff into the role of ConcretePrototype. As part of the con- 
tract creation process, generalization relation between Graphic and Staff classes 
are enforced (if there is already such relation, even transitive, it is taken into 
account). 

As it can be seen, another vacant role ConcretePrototype jumps out, be- 
cause according to multiplicity constraint there may be any number of classes 
cast into the role of ConcretePrototype (Fig. 1). This time, mentioned vacant 
role is optional because there is already existing ConcretePrototype cast role 
depending from Graphic cast to role Prototype, so local condition of correctness 
is satisfied. However, clone operations of both Graphic and Staff classes are not 
cast to respective roles, yet. Therefore global condition of completeness is not 
yet satisfied. After casting roles Prototype. Clone and ConcretePrototype. Clone, 
the pattern instance will be complete, because it will contain cast and optional 
vacant roles only. 
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Vacant role Proto type. Clone 

1^ GiaphicTool IS playing role of Client [asks this prototype to done Iself: Graphic playng role of Rolotupe) 
igi Vacant role Client Operation 

8 Vacant role Client (asks this prototype to done itself. Graphic piayir^ role of Prototvpel 
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Client.Operation 



Task 

list 



B (Q Graphic is playing role of Prototype [declares interface for cloning Staff playing role of Cor^fetePrototvDe . Grc 

^ Vacant lole Piototope. Clone 

5 (c) GraphicT ool is playing role of Client [asks this prototype to clone itself: Graphic playing role of Prototupel 
Vacant lole CienLOpeiation 

II ® Staff is playing rote of ConcretePrototupe [implements operation of cloning Giaphic playing role of Prototupe l 
(O Vacartl role Client (asks this prototype to done itself: Graphic playing role of Prototvpel 
• @ Vacant role ConcretePrototvDe [impfements operation of cloning Graphic playing role of Profotgoe l 



Fig. 7. Example of instance evolution 
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5 Prototype CASE Tool Implementation 

To evaluate proposed approach we have designed and implemented a CASE 
tool [14] as an environment enabling work with pattern instance in conjunction 
with traditional object-oriented design. With our declared effort not to duplicate 
existing tool support and to reuse and enrich it instead, we have decided to 
make the tool as an add-in to a renowned CASE tool. Rational Rose became 
our selection as a host environment. 

To make such a tool possible, defined external representation of pattern tem- 
plate as well as pattern instance has been defined. The pattern template consists 
of pattern structure defined as a Rational Rose class model, role graph and com- 
patibility matrix are supplied as a XML file. We have represented all 23 patterns 
described in [1] as pattern templates to be acceptable for implemented CASE 
tool. The resulting pattern instances are not only integrated into class model but 
there is also XML representation of instance graph for each pattern instance in- 
cluded in Rose model. This pattern instance representation is directly embedded 
into Rational Rose file, to reduce the possibility of inconsistency between Rose 
model and pattern instance model. By storing both parts of model in single file, 
easier distribution of models has been achieved. 

The main window of the tool remains always visible during design work as 
a slightly sophisticated “toolbar” (Fig. 8). Therefore the designer can seam- 
lessly switch between working with common design primitives and working with 
patterns. The role may be cast by selecting existing primitive meeting the con- 
straints imposed by pattern template or there may be generated new primitive 
which become cast into the selected role, thus it is combining the top-down and 
bottom-up approaches [12], but in an incremental way. 
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Fig. 8. CASE tool user interface 



The tool also allows pinpointing any pattern instance as a newly created class 
diagram including all primitives participating in it and pattern induced relations 
among them. 



6 Related Work 

There is quite high number of commercial and non-commercial CASE environ- 
ments, more or less supporting applying design patterns. 
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Rational XDE [15] is a full-fledged CASE tool focused on design. The designer 
is guided through the process of pattern instantiation through the set of wizards 
presenting sequence of predefined steps. The downside is the need of predeflnition 
of one or few possible sequences of instantiation steps for each pattern and further 
modifications of the existing pattern instance are more complicated. Borland 
Together [16] is more focused on doing UML design and Java code generation 
simultaneously. The pattern instantiation is again controlled by a wizard. 

Similar way of role system modeling using a graph is described in [17]. In this 
case the ambiguity of relations between role instances is resolved by “slicing” 
the pattern definition graph into unambiguous subgraphs interconnected with 
common nodes representing roles. 

One of the most perspective tools is the FRED (Framework Editor) [18]. 
Its principle is similar to that of Borland Together except it is less focused 
on code generation and more concentrated on the design patterns and archi- 
tectural view of software frameworks, however it contains quite elaborate code 
generation. Our research is significantly based on the results that this working 
group achieved. In their reports [13,8,10], they conceived statement of needing 
iterative, incremental and interactive way of supporting software design. The 
specialization pattern is modeled using a pattern graph and a set of constraints. 
Pattern graph itself defines relations between roles as well as properties and con- 
straints of actual model elements cast into the roles (in this case Java language 
structural constructs) . This pattern graph is then used to induce casting graph, 
which describes relations among actual role instances and is used to generate 
the framework specialization instructions for developer using the product. 

We found the above mentioned idea of describing relations between roles us- 
ing a graph allowing inducing a task list to guide developer to be a sound one. To 
compare, our aim was to support design pattern instantiation in software design, 
thus lifting it a level closer towards abstraction and to isolate core knowledge 
about design pattern to make it possible to be instantiated. Consequently we 
extended the concept by consolidating information specific for particular meta- 
model into a pattern structure and independent role and role relation description 
into separate role graph and compatibility matrix views. Therefore our approach 
enables to reuse pattern structure description which is readily available, e.g. in 
form of class models. Additionally, role graph and compatibility matrix allows 
relaxing the structural constraints on pattern instances. All these components 
form different views of the same design pattern considering different aspects of 
instantiation process. 

7 Conclusions 

The main purpose of this paper is to present the foundation of design pattern 
support method which has been designed to All the gaps in design pattern sup- 
port in current mainstream CASE tools. Most importantly, we decided to bring 
design pattern support to the same level of flexibility characteristic to working 
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with basic design primitives as classes, operations and attributes using major 
CASE tools. Of course, the specific nature of working with design patterns has 
been taken into account. 

To meet the research goals, development focused on enabling interactive pat- 
tern instantiation process which generates the meaningful offers for the designer. 
The purpose is flexible creation of consistent and correct pattern instances. The 
instantiation process is neither atomic, nor does it rely on predefined wizard- 
like steps. It is designed to make possible to start with any part of the design 
pattern and incrementally evolve the instance as needed to allow the designer 
to be focused on designed system-specific matters and not on recurring pattern 
itself. Another realized feature is the iterative way of working with patterns as 
the process allows modifications of already instantiated parts of a pattern while 
keeping it consistent. 

The process relies on formalized description of design pattern properties in 
form of pattern template. This description adds a set of properties needed to 
make instantiation possible, but leaves the representation of pattern structure 
separate to ensure not to be dependent on particular design metamodel or tech- 
nique. To appropriately represent pattern instance we have also designed for- 
malized representation of pattern instance, which became part of design model 
of developed software system. 

Implementing the CASE tool prototype based on intermediate results of the 
research gave us valuable feedback for planning further improvements to the 
method and to evaluate its current performance. We have used the tool to doc- 
ument the design model of the mid-sized software system - of the tool itself. 
By using this method, we have interactively instantiated 11 different patterns 
in design model of the tool. Although we have used the tool especially in the 
post-design phase (when almost all design work has been already done) , and not 
concurrently with the design as it is intended, we have appreciated the seamless 
integration of work with patterns and work with classes and their properties. 

A few shortcomings of designed method have been identified. They include 
efficient representation of certain combinations role relationships. Another po- 
tential improvement would be checking of role compatibility across all pattern 
instances instead of checking in scope of every single instance. The task is to 
determine a set of properties, which would describe how patterns may be com- 
posed to form larger unit while still checking the compatibility of responsibilities 
assigned to shared primitives. The future research will focus on tackling these 
issues. 

Extending pattern template with additional views and instantiating such 
views is another challenging research direction. Such views could describe be- 
havioral properties of the pattern in form of sequence diagrams and customized 
sequence diagrams. This behavioral model would become created and evolved 
during pattern instantiation just like class diagrams in current state. 
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Abstract. We propose a descriptive high-level language XDTrans de- 
voted to specify transformations over XML data. The language is based 
on unranked tree automata approach. In contrast to W3C’s XQuery or 
XSLT which require programming skills, our approach uses high-level 
abstractions reflecting intuitive understanding of tree oriented nature of 
XML data. XDTrans specihes transformations by means of rules which 
involve XPath expressions, node variables and non-terminal symbols de- 
noting fragments of a constructed result. We propose syntax and se- 
mantics for the language as well as algorithms translating a class of 
transformations into XSLT. 



1 Introduction 

Transformation of XML data becomes increasingly important along with devel- 
opment of web-oriented applications (e.g. Web Services, e-commerce, information 
retrieval, dissemination and integration on the Web), where data structures of 
one application must be transformed into a form accepted by another one. A 
transformation of XML data can be carried out by means of W3C’s languages 
XSLT [1] or XQuery [2]. However, every time when an XML document should 
be transformed into some other form, a new XSLT (or XQuery) program must 
be written, which requires programming skills. Thus, the operational nature of 
XSLT and XQuery makes them less desirable candidates for high-level transfor- 
mation specification [3]. To avoid programming, other transformation languages 
have been proposed [3,4,5]. 

In this paper we propose a new language called XDTrans that is devoted to 
specify transformations for XML data. Some preliminary ideas of the approach 
underlying XDTrans were presented in [6] and [7]. We discuss both syntax and 
semantics of the language as well as some representative examples illustrating 
using of it. We assume that the user perceives XML documents as data trees 
according to DOM model [8], and is familiar with syntax and semantics of XPath 
expressions [9] . Main advantages and novelties of the approach are as follows: 

— XDTrans expressions are high-level transformation rules reflecting intuitive 
understanding how an output data tree is constructed from input data trees, 
and involve XPath expressions, node variables and some non-terminal sym- 
bols (concepts) denoting subtrees in a constructed data tree; 
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— expressive power of XDTrans corresponds to structural recursion [10] and a 
fragment of top-down XSLT (without functions), however, in XDTrans we 
can join arbitrary number of different documents (not supported by XSLT 
[ 11 ]); 

— a transformation of a single XML document can be translated into XSLT 
program - we propose algorithms for such translations; 

— in contrast to XSLT, which can be used for transforming documents hav- 
ing only a standard form, XDTrans semantic functions could be applied 
to transform non-standard representations of XML documents (e.g. in rela- 
tional databases [6], [12], [13]). 

The structure of the paper is as follows. In Section 2 we propose XDTrans as 
a language for high-level transformation specification. We formulate its syntax 
and semantics, and discuss some examples illustrating its use. In Section 3, 
algorithms translating a class of XDTrans programs into XSLT are proposed. 
Section 4 concludes the paper. 



2 High-Level Transformation Specification — XDTrans 

2.1 Syntax of XDTrans 

According to W3C standard, any XML document can be represented as a data 
tree [8], where a node conforms to one of the seven node types: root, element, 
attribute, text, namespace, processing instruction, and comment. In this paper, 
we restrict our attention to four first of them. Every node has a unique node 
identifier (nid). A data tree can be formalized as follows [7]: 

Definition 1. A data tree is an expression defined by the syntax: 
data tree ::= nidftree), 

tree ::= e-tree \ a-tree \ t-tree, 

e-tree ::= <e,nid>{tree, ...,tree), (element tree), 

a-tree ::= <a,nid>{s), (attribute tree), 

t-tree ::=nid{s), (text tree), 

where nid, e, a, and s are from, respectively, a set M of node identifiers, a set 
Se of element labels, a set Sa of attribute labels, and a set S of string values. 
By B>E,s{Af), where S = Se U Sa, will be denoted a set of all data trees over 
S and S with node identifiers from M . □ 

Further on we assume that C is a set of non-terminal symbols called concepts, 
and P is a set of XPath expressions in which some variables can appear. 

The goal of transformation is to convert a set of input data trees into an 
expected single output data tree. A transformation can be specified by a set of 
transformation rules (or rules for short). Every rule determines a type of expected 
final or intermediate result data tree in a form of a terminal or non-terminal tree 
expression. 
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Definition 2. A tree expression over S, C, S and V conforms to the following 
syntax: 

T ::= s \ E \ a{s) \ a{E) \ C{E) \ e(r, 

where: s € S, E € V,a € Ea, C € C,e € Ee- The set of all tree expressions will 
be denoted by Te,s{C,'P) . □ 



Definition 3. A transformation specification language is a system 
XDTrans = {E,C, S, START, V, 7^), 

where START G C is the initial concept, and TZ is a finite set of rules of the 
following two forms: 

{C,E)^t, 

(C, ($t;i : El, ...,%Vp : Efi)) -)■ r, 

where C G C, any E (possibly with subscripts) is from V, $f (possibly with 
subscripts) is a node variable, r G Te,s{C,V) , and every node variable (if any) 
occurring in the body occurs also in the head of the rule. □ 

The head of a rule includes a concept C which will be rewritten by the body 
of the rule. A rule with concept C in the head defines this concept. We assume 
that any concept in a given set of rules must be defined, and that every concept 
has exactly one definition. So, our system is deterministic. Recursive definitions 
for concepts are allowed (e.g. SUB _P ART in Example 3(1)). There must be exactly 
one rule, the initialization rule, defining the initial concept START. In order to 
refer to the root of a document we use ”/” (if the document is understood) or 
a root node identifying the document (e.g. function document(” URI” ) returning 
the root node). 

In Fig. 1 there is an example of XML document suppliers, its DTD and data 
tree. The document parts in Fig. 2 is a transformed form of (a selection of) the 
document from Fig. 1 (see Example 1). 

The following example illustrates how XDTrans can be used to transform 
suppliers into parts: 

Example 1. Transform suppliers into parts, where each part has as attributes 
name of the part and name of its supplier, include only two first parts of each 
supplier. 

(1) transformation specification without variables: 

(START, /suppliers) — > parts(PART( . )) 

(PART, supplier/part [positionO < 2] ) — >■ part (PNAME( . ) , SNAME(.)) 

(PNAME,.) — ^ Opname (text 0 ) 

(SNAME,.) — ^ ../Osname 

(2) transformation specification with variables: 

(START, /suppliers) parts (PART( .) ) 

(PART, ($n : supplier, 

$p:$a/part [positionO <2] , 

$s : $n/@sname) ) — >■ part (PNAME ($p), $s) 

(PNAME,.) — > Spname (text 0 ) 

□ 
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<suppliers> 

<supplier sname="sl"> 
<part>pK/part> 

<part>p2</part> 

<part>p3</part> 

</ supplier> 

<supplier sname="s2"> 
<part>pK/part> 

</ supplier> 

</ suppliers> 

<!D0CTYPE suppliers [ 

<! ELEMENT suppliers (supplier*) > 
<! ELEMENT supplier (part+)> 
<!ATTLIST supplier 

sname CDATA #REQUIRED> 

<! ELEMENT part (#PCDATA)> ]> 



@ 

suppliers 




Fig. 1. XML document suppliers, its DTD and date tree 



<parts> 

<part pname="pl" sname="sl"/ 
<part pname="p2" sname="sl"/ 
<part pname="pl" sname="s2" 
</parts> 



<!D0CTYPE parts [ 

<! ELEMENT parts (part*)> 

<! ELEMENT part EMPTY> 
<!ATTLIST part 

pname CDATA #REQUIRED 
sname CDATA #REQUIRED>] > 




Fig. 2. XML document parts, its DTD and date tree 



2.2 Semantics for XD Trans 

Every XDTrans rule is evaluated in an evaluation context (or context for short) 
like XPath expressions are [9], [14]. We will write (C,E){S,n) to denote that 
the head of a rule is evaluated in a context (S,n), where S is an ordered set of 
distinct nodes (a context set), and n is a context node, n € S (() denotes the 
empty context). A context is used to evaluate XPath expression(s) included in 
the head of the rule. Evaluation of the head produces a new context set (the 
output context set). Every output context determined by the output context set 
is then used to process the body of the rule, where recursively the same or other 
rules can be invoked. 

A rule is formulated against an input data tree(s) and the result of the rule is 
an output data tree (possibly non-terminal). An XPath expression (or a sequence 
of XPath expressions labeled with variables) occurring in the head of the rule, 
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determines a number of (sub)trees instantiated from the tree expression being 
the body of the rule. All trees developed from the tree expression have a common 
structure but each of them has unique nodes (node identifiers). If each leaf of 
such a tree does not contain any concept then the tree is a terminal tree otherwise 
it is a non-terminal tree. 

Now we define how a rule transforms input date tree(s) into an output data 
tree. By N_e(S', n) we denote an ordered set of nodes obtained by evaluating 
if in a context (S,n). For example, for the data tree from Fig. 1, we have: 

We will use a Skolem function newQ that for every invocation returns a 
unique node identifier. Now, we will define a semantics for heads of XDTrans 
rules and next a semantics for bodies of the rules. 

Evaluation of{C,E) t in the context () and (S.,n) 

1. First, the head of the rule must be evaluated in a given context: 

— the result of the evaluation is an ordered set S\ = N_e(<S', n) = 

(ni, w > 1, referred to as an output context set, 

~ for any output context (Si,ni), 1 < i < m, the body r of the rule must 
be processed; 

— it follows from the restrictions on XML documents that for an ini- 
tialization rule its head qualification E must return a singleton, i.e. 

= Nb() = (ni), i.e. m = 1. 

2. In the next step, r should be evaluated in m contexts, i.e. the expression 
r((S'i, TT-i), ..., {Si,nm)),rn > 1 must be processed: 

HI. For an initialization rule we start from the expression (START, if) () and 
rewrite it by the expression neui()(r((ni), ni)), where (ni) is the result 
of evaluation E in the empty context. Fig. 3. 



STARTO 



Fig. 3. Rewriting specified by a transformation rule (START, if) — >■ r in the empty 
context. 

H2. For a non initialization rule (C,E) — >■ r and a context (S,n), where 
N_e(*S', n) = = (ni, ..., rim), the expression x{{C, E){S,n)) is rewritten 

by x(r(S'i,ni), ..., r(5'i, n^)). Fig. 4. 





Fig. 4. Rewriting specified by a rule {C, if) — >■ r in a context {S, n). 
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Evaluation of{C,{$vi : Ei,...,$Vp : Ep)) t in the context () and {S,n) 

1. The result of processing an expression ($ui : Ei,...,$Vp : Ep) in a context 
(S,n) (or in ()) is an ordered set 17 = {tui, > 1, of distinct valu- 

ations of variables. Every valuation w € 17 assigns a node of the input data 
tree to $Vi such that € N£j.(S', n) (or w($Ui) € N^j-O), for every i, 

1 < * < p. For an initialization rule, 17 must have at most one element. 

H3. An initialization rule (START, (Srii : Ei,...,$Up : Ep)){) is rewritten by 
nerc()(T((wi), wi)), where oji is the single valuation satisfying the ex- 
pression ($ui : El, ...,$Vp : Ep) in the empty context. Fig. 5. 



Fig. 5. Rewriting specified by a rule (START, (Sui : Ei , ..., $Vp : Ep)) — >■ r in the empty 
context. 

H4. For a non initialization rule (C, ($ui : Ei,...,$Vp : Ep)) — ^ r and a 
context (S,n), where 17 = (wi, ..., Wm), w > 1, a:(C, (Srii : Ei,...,$Vp : 
Ep)){S,n)) is converted into a;(r(l7, wi), ...,r(l7,o;m)). Fig. 6. 



Fig. 6. Rewriting specified by a rule (C, ($ni : Ei, %Vp : Ep)) — >• r in a context {S, n). 

2. Let £’[$u](l7, u>') denote that an expression E with the only variable $v is to 
be evaluated in a context (see e.g. Fig. 5 and Fig. 6.). In this case 

the context is given by means of valuations, so it should be resolved first. 
The resolution is achieved by replacing every valuation with its value for 
the variable $v. Thus the resolved version of the expression is: £'[$u](5', n), 
where S = {w($u) | u> € 17}, and n = uj'{$v). 



Evaluation of the body of a rule 

The meaning of transformations specified by expressions of the form r(S', n) is 
illustrated in Fig. 7. If the result does not depend on a whole context we use the 
notation (., .). 

For the all possible tree expressions forming the body of a rule (according to 
Definition 2), we have: 




X 



X 
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( 1 ) 



( 2 ) 




Fig. 7 . Graphical interpretation of rewritings specified by bodies of rules 



Bl. The body of the form s creates a text node with the string value s. Rewriting 
imposed by the expression is independent of a context, Fig. 7(1): 
x{s{S,n)) => x(new()(s)). 

B2. For an expression of the form E in a context {S,n), where n) = 

subtrees from the input tree(s) identified by n^,! < i < m, 
are copied into the output tree. The expression copy(rii) recursively copies a 
source subtree denoted by and creates a new node identifier for any copy 
of a source node, Fig. 7(2): 

x{E{S,n)) ^ x{copy{ni), copy (rim))- 

B3. An expression of the form a(s) creates an attribute node labeled a with 
the string value s. Rewriting defined by the expression is independent of a 
context. Fig. 7(3): 

x{a{s){S,n)) => x{< a,new{) > (s)). 

B4. For an expression of the form a{E) in a context {S,n), where N^;(5', n) = 
(ni), an attribute node labeled a with the string value equal to value{ni) is 
created. Fig. 7(4): 

x{a{E){S, n)) => x(< a, new{) > {value{ni))) . 

B5. If S'! =Nb( 5', n) = (m, ..., rim), then an expression a;(C'(if)(S', n)) is replaced 
by the expression x{C{Si,ni), ...,C{Si,nm)), where C{Si,rii) denotes an 
invocation of the rule identified by C in a context (S’ljrii),! < i < m, Fig. 
7(5): 

x{C{E){S, n)) ^ x{C{Si,m ), ..., C{Si,n^)). 

B6. For an expression e(ri, ..., r^) in a context (S', n) a new element node labeled e 
is created with q subtrees, where i-th subtree will be developed by evaluating 
expressions in the context (S, n), I < i < q, Fig. 7(6): 

x(e(ri, ...,Tq)(S,n)) ^ x{< e,new{) > (n(S,n), ..., t,(S, n)). 
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Transformations from Example 1 are carried out as follows: 

Transformation (1) 

1. Evaluation of the initialization rule in the empty context: 

~ Rule: (START, /suppliers) — > parts(PART(.)) 

~ Value of the head qualifier in the evaluation context: N/suppiiers() = (1) 
— Rewritings: 

START() ^ 50 (parts (PART (,)((1),1)) 

parts (PART( .) ((1), 1)) ^ <parts ,51> (PART( . ) ((1), 1)) 
PART(.)((1), 1) ^PART((1),1) 

2. Evaluation of the rule defining PART in context ((1), 1): 

~ Rule: 

(part, supplier/part[position() < 2]) — >■ part(PNAME(.), SNAME(.)) 

~ Value of the head qualifier in the evaluation context: 

^supplier/part[position()<2] ((f ) 7 f ) (d, 6, 12) 

~ Rewritings: 

PART((1), 1) ^ (part(PNAME(.) ,SNAME(.))((4,6,12),4), 
part (PNAME ( . ) , SNAME ( . ) ) ((4, 6, 12) , 6) , 
part (PNAME ( . ) , SNAME ( . ) ) ((4, 6, 12) , 12) ) 

part (PNAME ( . ) , SNAME ( . ) ) ((4, 6, 12), 6) ^ 

^ <part , 52> (PNAME ( . ) ((4, 6, 12) , 4) , SNAME ( . ) ((4, 6, 12), 4) ) 

PNAME (.) ((4, 6, 12), 4) ^ PNAME((4), 4), since N.((4, 6, 12), 4) = (4) 
SNAME (.) ((4, 6, 12), 4) ^ SNAME((4),4) 

3. Evaluation of the rule defining PNAME in context ((4), 4): 

— Rule: (PNAME, .) — > @pncime(text()) 

— Value of the head qualifier in the evaluation context: N.((4),4) = (4) 

— Rewritings: 

PNAME((4),4) ^ Qpname (text 0 ) ((4), 4) 

Opname (text 0 ) ((4), 4) <@pname,53>(value(5)), 

since Nteo;t()((4), 4) = 5 

4. Evaluation of the rule defining SNAME in context ((4), 4): 

~ Rule: (SNAME, .) — >■ ../@snemie 

~ Value of the head qualifier in the evaluation context: N.((4),4) = (4) 

~ Rewritings: 

SNAME((4),4) ^ . ./@sname((4),4) 

. ./®sname((4),4) ^ copy (3), since N../@sname((4), 4) = (3) 

5. Analogously for the remaining cases. 
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Transformation (2) 

1. Evaluation of the initialization rule in the empty context - as for transfor- 
mation (1). 

2. Evaluation of the rule defining PART in context ((1), 1): 

— Rule: 

(PART, ($i;: supplier, 

$p:$i;/part [positionO <2] , 

$s : $ti/@sname) ) — part (PNAME($p), $s) 

~ Value of the head qualifier in the evaluation context: 

a sequence 12 = { 101 , 102 , 103 ) of three valuations defined as follows: 

LOi : (wi($u) = 2, o;i($p) = 4, wi($s) = 3), 

LO 2 ■ (w2(Sn) = 2, i02{%p) = 6, W2($s) = 3) , 

L 03 : (w3($t;) = 10, o;3($p) = 12, W3($s) = 11) . 

~ Rewritings: 

PART((1),1) ^ (part(PNAME($p) ,$s))(12,cji), 
part (PNAME($p) ,$s))(l7,u'2), 
part (PNAME($p) ,$s))(17,cj3)) 

part(PNAME($p) ,$s))(12,Wi) ^ 

^ <part , 52> (PNAME(Sp) (12, u>i) ,$s(12, toi)) 

PNAME($p) (12, wi) # PNAME((4),4), since N$p(12,wi) = (wi($p)) = (4) 

$s(12,wi) ^ copy (3), since N$s(12,wi) = (wi($s)) = (3) 

3. Evaluation of the rule PNAME in the context ((4), 4): in the same way as for 
the transformation (1). 

4. Similarly for the remaining cases. 

2.3 Examples 

The transformation discussed in Example 1 can be formulated without using 
variables (specification 1) or with variables (specification 2). Now we will show 
examples of transformations, where it is necessary to use variables. In general, 
variables are necessary when the transformation requires a join condition com- 
paring two or more XML values belonging to the same or to two different doc- 
uments. 

Example 2. (Join) Join hooks and papers documents into a bib document. 

Input ’’papers. xml”: Output: <bib> 

<papers>. . .</papers> <papers>. . .</papers> 

Input ’’books. xml”: <books>. . ,</books> 

<books> . . . </books> </bib> 

Transformation specification: 

(START , ($p : document ( "papers . xml) /papers , 

$6:document("books.xml)/books) — >■ bib($p,$6) 
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Example 3. (Recursion) Convert a flat list of part elements, partlist, into a tree 
representation, parttree, based on partof attributes [15], and do the inverse 
transformation. 

Input DTD: 

<!D0CTYPE partlist [ 

<! ELEMENT partlist (part*)> 

<! ELEMENT part EMPTY> 

<!ATTL1ST part 

partid CDATA #REQU1RED 
partof CDATA #1MPL1ED 
name CDATA #REQU1RED>] > 

(1) Transformation specification: 

(START, /partlist) — >■ parttree (MA1N_PART( . ) ) 

(MAlN_PART,part [not (Opartof )] ) — >■ part (Opart id,@name ,SUB_PART( .) ) 
(SUB_PART, ($^ 1 :0partid, 

$r ;2 : . . /part [0partof=$r;i] ) ) — >■ part($ri 2 /@partid,$i; 2 / 0 name, 

SUB_PART($W2)) 

(2) Inverse transformation specification: 

(START, /parttree) ^ partlist (MA1N_PART( .), SUB_PART( .) ) 

(MA1N_P ART, part) — >■ part(Qid,Oname) 

(SUB_PART,part//part) — >■ part (Qid,0name , Opartof ( . . /Oid) ) 

□ 

The enrich the expression power of transformation, we will use a filtering of 
a set of contexts. We explain it by the following example. 

Example /. (Context Altering) Let us assume that we want to set a Alter on a 
contexts using the position of a context node. For example, the transformation 
(1) defined in Example 1 would have the form: 

(START, /suppliers) — >■ parts(PART(.)) 

(PART, supplier/part [positionO < 2]) — >• part(PNAME(.[$currpos != 2])) 
(PNAME,.) — >• @pname(text()) 

Note that in PART rule we want to ignore the second context which is passed 
to the right-hand side of the rule. As we have seen, in our example we have 
three contexts determined by the context set (4,6,12) - so, the evaluation in the 
context ((4, 6, 12), 6) must be ignored. □ 

We assume that an XPath expression can use the following context variables: 

— Scrtrr, denoting the current context node - the same as the dot (.); 

— $currpos, denoting the position of the context node in a context set; 

— Ssize, denoting a size (number of elements) of a context set. 



Output DTD: 

<!D0CTYPE parttree [ 

<! ELEMENT parttree (part*)> 
<! ELEMENT part (part*)> 
<!ATTL1ST part 

partid CDATA #REQU1RED 
name CDATA #REQU1RED>] > 
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3 Translation from XDTrans to XSLT 

The following two algorithms transform a specification made in XDTrans into an 
XSLT program which carries out transformation on document instances. In gen- 
eral, there are many possible XSLT programs that can perform a given transfor- 
mation. This method can be applied only to transformations over one document 
because joining of many documents is not supported by XSLT. 

Algorithm 1 defines translation for the head of a rule and uses Algorithm 2 
to translate the bode of a rule. As the result, an XSLT program (stylesheet) is 
obtained. 

Algorithm 1 (generating XSLT templates). 

Input: a specification rule r ^ TZ 
Output: XSLT template A(r) 

A(r) = case r of 
(START, T;) ^ r: 

<xsl:template name=” START” match=”/”> 

<xsl:for-each select=”i?”> 

P(^) 

</xsl:for-each> 

< / xsl : template > 

<xsl:template name=”C” 

<xsl:param name=”curr”> 1 </xsl:param> 

<xsl:param name=”currpos”> 2 </xsl:param> 

<xsl:param name=”size”> 3 </xsl:param> 

<xsl:for-each select=” $curr/if” > 

p(t) 

</xsl:for-each> 

< / xsl : template > 

(C, ($?;i : Ei,...,$Vn : Tl„)) -)> r : 

<xsl:template name=”C” 

<xsl:param name=”curr”> 1 </xsl:param> 

<xsl:param name=”currpos”> 2 </xsl:param> 

<xsl:param name=”size”> 3 </xsl:param> 

<xsl:for-each select=” $curr/ifi” > 

<xsl:variable name=”z;i” select=”.”/> 

<xsl:for-each select=” $curr/if„” > 

<xsl:variable name=”i;„” select=”.”/> 

p{t) 

<xsl:for-each> 

<xsl:for-each> 

< / xsl : template > 



endcase 
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Algorithm 2 (generating XSLT elements). 

Input: a non-terminal tree expression r € Ts^s{C,V) 

Output: XSLT element p(r) 
p(r) = case t of 

s : <xsl:text> 
s 

</xsl:text> 

E : <xsl:for-each select=”i?”/> 

<xsl:copy-of select=”.”/> 

</xsl:for-each> 

a(s) : <xsl:attribute name=”a”> 
s 

< / xsl: attribute > 

a{E) : <xsl:attribute name=”a”> 

<xsl:value-of select=”£i”/> 

</xsl:attribute> 

C{E) : <xsl:for-each select=”i?”/> 

<xsl:call-template name=”C”> 

<xsl:with-param name=”curr” select=”.”/> 
<xsl:with-param name=” currpos” select=”position()”/> 
<xsl:with-param name=”size” select=”last()”/> 

< / xsl:call-template> 

</xsl:for-each> 

e(ri, : <xsl:element name=”e”> 

p(ti) ... p(r„) 

< /xsl: element > 

endcase 



Example 5. Application of Algorithm 1 (and Algorithm 2) to the transformation 
specification from Example 3(1), produces the XSLT script listed in Fig. 8. 



4 Conclusions 

In the paper we proposed a method and a high-level language, XDTrans, devoted 
to high-level specification for XML data transformation. The language is both 
descriptive and expressive, and is based on ideas rooted in tree automata [16]. 
From the user point of view, the specification can be perceived as a refinement 
process in which properties of constructed output document are systematically 
specified and refined. The method supports the user’s intuition for defining trees 
according to the top-down way of thinking. A program in XDTrans consists of a 
set of rules (possibly recursive). Each rule defines a concept (i.e. a non-terminal 
symbol) included in the head of the rule, which is either an initial concept or 
occurs in leaves of the current state of a constructed tree. The rule specifies how 
the concept should be replaced. Finally, a terminal tree (without occurrences of 
any concept) should be obtained. 
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<xsl :template naine="START" match="/"> 
<xsl : f or-each select=" /part list "> 
<xsl: element name="parttree"> 



</xsl : element> 
</xsl : f or-each> 
</xsl : template> 



<xsl : f or-each select="."> 

<xsl : call-template name="MAIN_PART"> 
<xsl : with-param name=" curr " select=' 
</xsl : call-template> 



/> <xsl:param name="curr">K/xsl :param> 



<xsl rtemplate name="SUB_PART"> 



<xsl : f or-each select="$curr/@partid"> 
<xsl : variable name="vl" select="."/> 
<xsl : f or-each 



</xsl :f or-each> 
</xsl : element> 

</ xsl : f or-each> 
</xsl : template> 



<xsl : variable name="v2" select="."/> 
<xsl:element name="part"> 

<xsl : f or-each select="$v2/0partid"> 



select="$curr/ . . /part [Opartof =$vl] "> 



<xsl :template name="MAIN_PART"> 

<xsl :param name="curr">K/xsl :param> 




<xsl : f or-each select ="$curr /part [not (Opartof )] 
<xsl:element name="part"> 

<xsl : f or-each select="@partid"> 



"> </xsl : f or-each> 

<xsl : f or-each select="$v2/0name"> 



<xsl:copy-of select="."/> 
</xsl : f or-each> 

<xsl : f or-each select="@name"> 



<xsl:copy-of select="."/> 
</xsl : f or-each> 

<xsl : f or-each select="$v2"> 



<xsl:copy-of select="."/> 

</xsl : f or-each> 

<xsl : f or-each select="."> 

<xsl : call-template name="SUB_PART"> 



<xsl : call-template name="SUB_PART"> 

<xsl : with-param name="curr" select="."/> 



</xsl : call-template> 
</xsl ; f or-each> 

</xsl : element> 



<xsl: with-param name="curr" select="."/> 
</xsl : call-template> 

</xsl : f or-each> 



/> </xsl : f or-each> 



</xsl : f or-each> 



</xsl ; template> 



Fig. 8. XSLT program generated for transformation specification in Example 3(1) 

The semantics defined for XDTrans rules provides a way for implementation 
in any repository storing XML data. An algorithm doing this task must be able 
to process XPath expressions in such repositories. In particular, in [6] and [13] 
we discussed this problem when the repository is a relational database system. 
In this paper we show that a broad class of XDTrans specifications, except from 
those which join different documents, can be translated into XSLT programs 
which carry out the expected transformation over the standard representation of 
XML data. Since our specification is independent of the way in which XML data 
are represented, it could be also used to integrate heterogeneous data specifying 
appropriate transformations over them. 
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Abstract. In today’s global environment, the structure and presentation of in- 
formation may depend on the underlying context of the user. To address this is- 
sue, in previous work we have proposed multidimensional semistructured data 
(MSSD), where an information entity can have alternative variants, or facets, 
each holding under some world, and MOEM, a data model suitable for repre- 
senting MSSD. In this paper we briefly present MQL, a query language for 
MSSD that supports context-driven queries, and we attempt to motivate the di- 
rect use of context in data models and query languages by comparing MOEM 
and MQL with equivalent, context-unaware forms of representing and querying 
information. Specifically, we implemented an evaluation process for MQL dur- 
ing which MQL queries are translated to equivalent Lorel queries, and MOEM 
databases are transformed to corresponding OEM databases. The comparison 
between the two query languages and data models demonstrates the benefits of 
treating context as first-class citizen. We illustrate this query translation process 
using a cross-world MQL query, which has no direct counterpart in context- 
unaware query languages and data models. 



1 Introduction 

The Web posed a number of new problems to the management of data, and the need 
for metadata at a semantic level was soon realized [9]. One such problem is that, while 
in traditional databases and information systems the number of users is more or less 
known and their background is to a great extent homogeneous, Web users do not 
share the same background and do not apply the same conventions when interpreting 
data. Such users can have different perspectives of the same entities, a situation that 
should be taken into account by Web data models and query languages. A related 
issue is that information providers often need to manage different variations of essen- 
tially the same information, which are targeted to different consumer groups. 

Those problems call for a way to represent and query information entities that 
manifest different facets, whose contents can vary in structure and value. As a simple 
example imagine a product (car, laptop computer, etc.) whose specification changes 
according to the country it is being exported. Or a Web page that is to be displayed on 
devices with different capabilities, like mobile phones, PDAs, and personal com- 
puters. Another example is a report that must be represented at various degrees of 
detail and in various languages. 
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In previous work we proposed multidimensional semistructured data (MSSD) 
[2,14,16], which are semistructured data [3] that present different /acets under different 
contexts. We argued [4] that Web data should be able to adapt to different contexts, 
and that this capability should be managed in a uniform way at the level of database 
and information systems. Context-aware data models and query languages can be 
applied on a variety of cases and domains; in [6,7,15] we showed how they can be 
used to represent and query histories of semistructured databases that evolve over 
time. 

Context has been used in diverse areas of computer science as a tool for reasoning 
with viewpoints and background beliefs, and as an abstraction mechanism for dealing 
with complexity, heterogeneity, and partial knowledge. A formal framework for rea- 
soning upon a subset of a global knowledge base can be found in [17], while examples 
of how context can be used for partitioning an information base into manageable 
fragments of related objects can be found in [8,13]. Our perception of context has also 
been used in OMSwe, a web-publishing platform described in [1,12]. OMSwe is based 
on an Object DBMS, which has been extended to support a flexible, domain- 
independent model for information delivery where context plays a pivotal role. 

In this paper, we give a short overview of Multidimensional Query Language 
(MQL) [4,7] that treats context as first-class citizen and can express context-driven 
queries, in which context is important for selecting the right data. MQL is based on 
key concepts of Lorel [5], and its data model is Multidimensional OEM (MOEM) [2], 
an extension of OEM [3] suitable for MSSD. We present an evaluation process for 
MQL queries that we have implemented in a prototype system. As part of this evalua- 
tion process, MQL queries are translated to “equivalent” Lorel queries, and MOEM 
databases are transformed to corresponding OEM databases. Our purpose is to intui- 
tively compare the two query languages and data models; although OEM and Lorel 
are not aware of the notion of context, they are in principle capable of handling con- 
text-dependent information encoded in a graph. Through this comparison, we demon- 
strate the benefits of directly supporting context as first-class citizen; MQL and 
MOEM are much more elegant and expressive when context is involved, while they 
become as simple as Lorel and OEM when context is not an issue. The query transla- 
tion process is illustrated using a cross-world MQL query. Cross-world queries relate 
facets of information that hold under different worlds, and have no counterpart in 
context-unaware query languages and data models. 

The paper is structured as follows. Section 2 reviews preliminary material on 
MSSD. Section 3 introduces MQL. Section 4 presents in detail the MQL evaluation 
process; the transformation of MOEM to OEM is specified first, and then the transla- 
tion of MQL to Lorel is explained. Finally, Section 5 summarizes the conclusions. 



2 Multidimensional Semistructured Data 

The main difference between conventional and multidimensional semistructured data 
is the introduction of context specifiers. Context specifiers are syntactic constructs 
that are used to qualify pieces of semistructured data and specify sets of worlds under 
which those pieces hold. In this way, it is possible to have variants of the same infor- 
mation entity, each holding under a different set of worlds. An information entity that 
encompasses a number of variants is called multidimensional entity, and its variants 
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are called facets of the entity. Each facet is associated with a context that defines the 
conditions under which the facet becomes a holding facet of the multidimensional 
entity. 



2.1 Dimensions and Worlds 

The notion of world is fundamental in MSSD. A world is specified using parameters 
called dimensions, and represents an environment under which data obtain a sub- 
stance. The notion of world is defined [2] with respect to a set of dimensions D and 
requires that every dimension in D be assigned a single value. 

In MSSD, sets of worlds are represented by context specifiers (or simply contexts), 
which are constraints on dimension values. The use of dimensions for representing 
worlds is shown through the following context specifiers: 

(a) [time=07 : 45] 

(b) [ language^greek, detail in { low, medium} ] 

(c) [season in { fall , spring} , daytime^noon 

a. I season^summer ] 

Context specifier (a) represents the worlds for which the dimension time has the 
value 07 : 45, while (b) represents the worlds for which language is greek and 
detail is either low or medium. Context specifier (c) is more complex, and repre- 
sents the worlds where season is either fall or spring and daytime is noon, 
together with the worlds where season is summer. For a set of {dimension, value) 
pairs to represent a world with respect to a set of dimensions D, it must contain ex- 
actly one pair for each dimension in D. Therefore, if D = {language, detail} 
with domains Vi anguage = (english, greek} and Vdetaii = (low, medium, 
high}, then { (language, greek) , (detail, low) } is one of the six possi- 
ble worlds with respect to D. This world is represented by context specifier (b), to- 
gether with the world { (language, greek) , (detail, medium) }. It is not 
necessary for a context specifier to contain values for every dimension in D. Omitting 
a dimension implies that its value may range over the whole domain. 

The context specifier [ ] is a universal context and represents the set of all possible 
worlds with respect to any set of dimensions D, while the context specifier [ - ] is an 
empty context and represents the empty set of worlds with respect to any set of dimen- 
sions D. In [2,4] we have defined operations on context specifiers, such as context 
intersection and context union that correspond to the conventional set operations of 
intersection and union on the related sets of worlds. We have also defined how a con- 
text specifier can be transformed to the set of worlds it represents with respect to a set 
of dimensions D. Moreover, context equality and context subset allow to compare 
contexts based on their respective set of worlds. 
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2.2 Multidimensional OEM 

Multidimensional Data Graph [4] is an extension of Object Exchange Model (OEM) 
[3,5], suitable for representing multidimensional semistructured data. Multidimen- 
sional Data Graph extends OEM with two new basic elements: 

• Multidimensional nodes represent multidimensional entities, and are used to group 
together nodes that constitute facets of the entities. Graphically, multidimensional 
nodes have a rectangular shape to distinguish them from conventional circular 
nodes. 

• Context edges are directed labeled edges that connect multidimensional nodes to 
their facets. The label of a context edge pointing to a facet, is a context specifier 
defining the set of worlds under which that facet holds. Context edges are drawn as 
thick lines, to distinguish them from conventional (thin-lined) OEM edges. 

In Multidimensional Data Graph the conventional circular nodes of OEM are 
called context nodes and represent facets associated with some context. Conventional 
OEM edges (thin-lined) are called entity edges and define relationships between ob- 
jects. All nodes are considered objects, and have unique object identifiers (aids). Con- 
text objects are divided into complex objects and atomic objects. Atomic objects have 
a value from one of the basic types, e.g. integer, real, strings, etc. A context edge 
cannot start from a context node, and an entity edge cannot start from a multidimen- 
sional node. Those two are the only constraints on the morphology of Multidimen- 
sional Data Graph. 

As an example, consider the simple Multidimensional Data Graph in Figure 1, 
which represents context-dependent information about a music club. For simplicity, 
the graph is not fully developed and some of the atomic objects do not have values 
attached. The music_club with oid &1 operates on a different address during the 
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summer than the rest of the year (in Athens it is usual for clubs to move south close to 
the sea in the summer period, and north towards the city center during the rest of the 
year). Except from having a different value, context objects can have a different struc- 
ture, as is the case of &10 and &15 which are facets of the multidimensional object 
address with oid &4. The menu of the club is available in three languages, namely 
English, Erench and Greek. In addition, the club has a couple of alternative parking 
places, depending on the time of day as expressed by the dimension daytime. The 
music-club review has two facets: node &16 is the low detail facet containing only 
the review score with value 6, while the high detail facet &17 contains in addition 
review comments in two languages. In what follows we formally define Multidimen- 
sional Data Graph. 

Let CS be the set of all context specifiers, L be the set of all labels, and A be the set 
of all atomic values. A Multidimensional Data Graph G is a finite directed edge- 
labeled multigraph G = (V^jj, r, v), where: 

1. The set of nodes V consists of multidimensional nodes and context nodes, V = 

u Context nodes are divided into complex nodes and atomic nodes, =V^U 

V,. 

2. The set of edges E consists of context edges and entity edges, E = E^^, such 

that E^, £(y„id xCS xV) and xL xV). 

3. r E V is the root, with the property that there exists a path from r to every other 
node in V. 

4. V is a function that assigns values to nodes, such that: v(x) = M if x e v(x) = C 

if X £ V^, and v(x)=v' (x) if x £ 17, where M and C are reserved values, and v'is a 
value function v' : —>A which assigns values to atomic nodes. 

Two fundamental concepts related to Multidimensional Data Graphs are explicit 
context and inherited context [2,4]. The explicit context of a context edge is the con- 
text specifier assigned to that edge, while the explicit context of an entity edge is 
considered to be the universal context specifier [ ] . The explicit context can be con- 
sidered as the “true” context only within the boundaries of a single multidimensional 
entity. When entities are connected together in a graph, the explicit context of an edge 
is not the “true” context, in the sense that it does not alone determine the worlds under 
which the destination node holds. The reason for this is that, when an entity e^ is part 
of (pointed by through an edge) another entity e^, then e^ can have substance only 
under the worlds that e, has substance. This can be conceived as if the context under 
which ej holds is inherited to e^. The context propagated in that way is combined with 
(constraint by) the explicit context of each edge to give the inherited context for that 
edge. The inherited context of a node is the union of the inherited contexts of incom- 
ing edges (the inherited context of the root is [ ] ). As an example, node &18 in Figure 
1 has inherited context [detail in {low, high}]. Worlds where detail is 
low are inherited through node &16, while worlds where detail is high are 
inherited through node & 1 7 . 

In Multidimensional Data Graph leaves are not restricted to atomic nodes, and can 
be complex or multidimensional nodes as well. This raises the question under which 
worlds does a path lead to a leaf that is an atomic node. Those worlds are given by 
context coverage, which is symmetric to inherited context, but propagates to the op- 
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posite direction: from the leaves up to the root of the graph. The context coverage of a 
node or an edge represents the worlds under which the node or edge has access to 
leaves that are atomic nodes. The context coverage of leaves that are atomic nodes is 
[ ] , while the context coverage of leaves that are complex nodes or multidimensional 
nodes is [-] . The context coverage of node &19 in Figure 1 is [lang in {gr, 
en} ] (all leaves in Figure 1 are considered atomic nodes). 

For every node or edge, the context intersection of its inherited context and its con- 
text coverage gives the inherited coverage of that node or edge. The inherited cover- 
age represents the worlds under which a node or edge may actually hold, as deter- 
mined by constraints accumulated from both above and below. A related concept is 
path inherited coverage, which is given by the context intersection of the inherited 
coverages of all edges in a path, and represents the worlds under which a complete 
path holds. 

A context-deterministic Multidimensional Data Graph is a Multidimensional Data 
Graph in which context nodes are accessible from a multidimensional node under 
mutually exclusive inherited coverages (hold under disjoint sets of worlds). Intui- 
tively, context-determinism means that, under any specific world, at most one context 
node is accessible from a multidimensional node. A Multidimensional OEM, or 
MOEM for short, is a context-deterministic Multidimensional Data Graph whose 
every node and edge has a non-empty inherited coverage. In an MOEM all nodes and 
edges hold under at least one world, and all leaves are atomic nodes. The Multidimen- 
sional Data Graph in Figure 1 is an MOEM. 

Given a world w, it is possible to reduce an MOEM to a conventional OEM graph 
holding under w, by eliminating nodes and edges whose inherited coverage does not 
contain w. A process that performs such a reduction to OEM is presented in [2]. In 
addition, given a set of worlds, it is possible to partially reduce an MOEM into a new 
MOEM, that encompasses exactly the OEM facets for the given set of worlds. 



3 Multidimensional Query Language 



Multidimensional Query Language (MQL) [4], is a query language designed espe- 
cially for MOEM databases, and is essentially an extension of Lorel [5]. An important 
feature of MQL is context path expressions, which are path expressions qualified with 
context specifiers and context variables. Context path expressions take advantage of 
the fact that every Multidimensional Data Graph can be transformed to a canonical 
form [4,7], where every context node is child of solely multidimensional node(s), and 
vice-versa. The canonical form of the MOEM in Figure 1 contains an additional mul- 
tidimensional node which is pointed by the entity edge labeled name, and whose only 
facet under every possible world (explicit context [ ] ) is the context node &3. It also 
contains similar multidimensional nodes for the context nodes &11, &12, &13, &14, 
&18, and &1 (the root of a graph in canonical form is always a multidimensional 
node). If a graph is in canonical form, every possible path is formed by a repeated 
succession of one context edge and one entity edge. Context path expressions are built 
around the canonical form, and therefore consist of a number of entity parts and facet 
parts succeeding one another. Entity parts follow a dot ( . ) and are matched against 
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entity edges, while facet parts follow a double colon ( : : ) and are matched against 
context edges: 

[detail^high] music_club : : [-] .review: : [-] X 

In this context path expression, music_club and review are entity parts, while 
the two empty context specifiers [ - ] are facets parts. A facet part matches a corre- 
sponding context edge, if it is subset of the explicit context of the edge, in other 
words, if every world it defines is covered by the explicit context of the edge. Conse- 
quently, the empty context [ - ] as a facet part matches any context edge. The context 
specifier [detail=high] is an inherited coverage qualifier and is matched against 
the path inherited coverage of a path. For a path to match an inherited coverage quali- 
fier, it must hold under every world specified by the qualifier. An inherited coverage 
qualifier may precede any entity part or facet part in a context path expression. Facet 
parts can often be omitted, implying the empty context [ - ] . Therefore, the above 
context path expression can also be written as: 

[detail^high] music_club . review X 

Evaluated on the graph of Figure 1, this context path expression causes the context 
object variable X to bind to node &17. Had we used a multidimensional object vari- 
able, denoted <X>, we would have caused it to bind to the multidimensional node & 5. 

Consider the following MQL query: 

select name: P, winter_street : Y 

from music_club X, 

X. [season=winter] address . street Y, 

X. [season=summer] address . street Z, 

X . name P 

where Z="Omirou" 

This is a cross-world query, which returns the name and the street address in win- 
ter of a music club whose summer address is known. Evaluated on the database of 
Eigure 1, variable P binds to node &3 and Y binds to &13. The result of an MQL 
query is always a Multidimensional Data Graph in the form of an mssd-expression 
[ 2 ]. 



4 Evaluating MQL Queries with LORE 

We have implemented MQL on top of LORE [10], analogously to Lorel, which has 
been implemented on top of an object database [5]. We have chosen LORE as a basis 
for implementing MQL, because our purpose was to see: (a) how an MQL query 
compares with an “equivalent” Lorel query, and (b) how an MOEM can be expressed 
through a conventional OEM. 

The overall architecture is shown in Figure 2. The process we want to implement is 
depicted as a dashed line, which starts from an MQL query, passes through an 
MOEM database, and concludes with a Multidimensional Data Graph that is the result 
of the query. 
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Fig. 2. Evaluating MQL queries using LORE 



The process that actually takes place is depicted as a normal line, and shows a Lo- 
rel query evaluated on an OEM database that returns an OEM graph as a result. This 
line together with the ellipse-shaped boxes is part of LORE, which is controlled by 
our system through the programming interface that it provides. 

The main issue is to define a transformation T from Multidimensional Data Graphs 
M to OEMs O = T(M), with the following properties: 

• The reverse transformation T“' exists, and if O is given then M = T“'(0) can be 
recovered. 

• It is possible to translate an MQL query q„ to an “equivalent” Lorel query q^. 

By equivalent we mean that if q„ evaluated on M returns M' and q^^ evaluated on O 
returns O', then T(M') = O'. Then, the answer to q„ can be computed by evaluating q^^ 
on T(M), and by applying the reverse transformation T“'(0') to the results of q^^. 

Those transformations and the MQL query translation are depicted in Eigure 2 as 
thick horizontal arrows. The system, among other things, implements those arrows 
and performs the following key steps: 

1 . Converts an MOEM database to an OEM database, which becomes the database of 
LORE. 

2. Translates an MQL query to a Lorel query, which is passed over to LORE for 
evaluation on the OEM database. 

3. Gets the results from LORE, and converts them from OEM back to Multidimen- 
sional Data Graph. 

Step 1 initializes the database and corresponds to the gray horizontal arrow in the 
middle of Eigure 2, while steps 2 and 3 are carried out every time an MQL query is 
submitted. The Multidimensional Data Graph of step 3 is the result of the MQL query 
of step 2 evaluated on the MOEM database of step 1 . 

In the following sections, we specify the three key steps listed above. The actual 
application that implements them is part of a more comprehensive platform [11] for 
MSSD, and is presented in Section 4.4. 
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Fig. 3. Representing Multidimensional Data Graph using OEM 



4.1 Transforming MOEM Databases to OEM 

In order to transform MOEM to OEM, we must use special OEM structures to repre- 
sent MOEM elements that do not have a counterpart in OEM, namely context edges 
and multidimensional nodes: context edges can be represented by OEM edges that 
have some special label, while multidimensional nodes correspond to OEM nodes 
from which these special edges depart. Moreover, we must encode context in OEM in 
a way Lorel can understand and handle. Contexts that must be encoded include ex- 
plicit contexts and inherited coverages of edges. 

Figure 3 gives an intuition of the transformation. It presents a simple Multidimen- 
sional Data Graph M together with its OEM counterpart O. Nodes with a capital letter 
in M correspond to nodes with the respective lowercase letter in O. Observe that for 
each edge in M an additional node exists in O, splitting the edge in two OEM edges. 
The role of this node is to group the encoded context(s) for the corresponding Multi- 
dimensional Data Graph edge. A number of reserved labels with special meaning are 
used. All reserved labels start with an underscore, and they are: _ett, _facet, 
_cxt, _icw, and _ecw. An entity edge is represented by an edge with the same 
label and a following edge labeled _ett. A context edge is represented by an edge 
labeled _f acet and a following edge labeled _cxt. Explicit contexts and inherited 
coverages of edges are converted to the worlds they represent, and all possible worlds 
are mapped to integers. Edges that are labeled _icw point to the enumerated worlds 
(integer-valued nodes) that belong to inherited coverages, while edges labeled _ecw 
point to the enumerated worlds that belong to explicit contexts. 

The actual transformation process of a Multidimensional Data Graph M = 

E^,„, E^„, r, v) to an OEM O is given below. 
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O <- MDGToOEM (M) is: 

1. For every world, add a new atomic node w to that corresponds to that world, 
having as value the integer mapped to the world. 

2. Move all (multidimensional) nodes from to (complex nodes). 

3. For every edge h = (q, 1, p) e add a new complex node u to V^„. Then, replace h 
with the new edges (q, 1, u) and (u, _ett, p) in E^„. Next, for every world repre- 
sented by the inherited coverage of h, add an edge (u, _icw, w) to E^„, where w 
corresponds to that world. 

4. For every edge h = (q, c, p) g E^^, add a new complex node u to Then, remove 
h from E^^,, and add the edges (q, _facet, u) and (u, _cxt, p) to E^„. Next, for 
every world represented by the inherited coverage of h, add an edge (u, _icw, w) 
to E^,|, where w corresponds to that world. Moreover, for every world represented 
by the explicit context of h, add an edge (u, _ecw, w) to E^^,, where w corresponds 
to that world. 

5. Return O = (V„„ E „, r, v). 

4.2 Translating MQL Queries to Lorel 

MQL queries can be translated to “equivalent” Lorel queries, which are evaluated on 
the OEM given by the transformation defined in the previous section. For this transla- 
tion to work, the MOEM database must be in canonical form when the transformation 
to OEM takes place. This allows context path expressions, which are built around the 
canonical form, to be translated to “equivalent” Lorel path expressions. 

In this section we specify such a translation that supports the major features of 
MQL. We have not addressed some features of MQL that seemed difficult or impos- 
sible to translate, like regular expressions in context path expressions {general context 
path expressions [4]). 

Converting Context Path Expressions to Path Expressions. To facilitate the 
comprehension of translated Lorel queries, we use E-HUB as identifier for nodes from 
which a _ett edge departs, MLD for nodes from which a _f acet edge departs, and 
C-HUB for nodes from which a _cxt edge departs. In addition, we use wl, w2, ... to 
denote the integers that have been mapped to worlds. 

We start with a very simple MQL from clause; 

from X. [c] label Y 

We assume [ c ] is a context specifier representing the worlds that correspond to 
W 4 , W 7 , and W 9 . As shown in [4], [c] is implied throughout the path it qualifies, and 
the clause can be written as: 

from X. [c] label : : [c] [-] Y 

This from clause can also be written in MQL using a multidimensional object 
variable <V> as: 

from X. [c] label <V>, 

<V>: : [c] [-] Y 
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The equivalent Lorel expression is: 

from X. label E-HUB, E-HUB. _ett MLD, 

MLD._facet C-HUB, C-HUB._cxt Y 

where E-HUB . _icw{W4 } = W 4 

and E-HUB ,_icw{W7 } = W 7 
and E-HUB ,_icw{W9 } = Wg 
and C-HUB ,_icw{W4 } = W 4 
and C-HUB ,_icw{W7 } = W 7 
and C-HUB ,_icw{W9 } = Wg 

The where clause states that the inherited coverages of the two MOEM edges 
must contain all the worlds specified by [c] . Observe the use of the variables W4, 
W7, and W9, which declare that it is not the same node that must be equal to W4, to W7, 
and to Wg (otherwise the condition would always be false). 

We now use the MQL query example of Section 3 , and apply the same process to 
its from clause. For brevity, we use [cl] to denote the context specifier [sea- 
son=winter], and [c2] to denote [season=summer] . The MQL query can 
now be written as: 



select name: P, winter_street : Y 
from [ - ] music_club <Vl>, <Vl>: :[-][-] X, 
X. [cl] address <V2>, <V2>: ; [cl] [-] V3 , 
V3 . [cl] street <V4>, <V4> : : [ cl ] [ - ] Y, 
X. [c2] address <V5>, <V5>: ; [c2] [-] V6, 
V6. [c2]street <V7>, <V7>: : [c2] [-] Z, 
X. [-]name <V8>, <V8>: :[-][-] P 
where Z="Omirou" 



The equivalent Lorel query is: 



select name: P, winter_street : Y 
from 



music_club E-HUBl, 



E-HUBl ._ett 
C-HUBl. ext 



Vl._facet C-HUBl 
X. address E-HUB2 , E-HUB2._ett 
V2._facet C-HUB2, C-HUB2._cxt 
V3. street E-HUB3 , E-HUB3._ett 
V4._facet C-HUB3 , C-HUB3._cxt 
X. address E-HUB4, E-HUB4._ett 
V5._facet C-HUB4, C-HUB4._cxt 
V 6 . street E-HUB5, E-HUB5._ett 
V7._facet C-HUB5, C-HUB5._cxt 
X.name E-HUB 6 , E-HUB 6 ._ett V 8 , 
V 8 ._facet C-HUB 6 , C-HUB 6 ._cxt 
where 



VI, 

X, 
V2, 
V3 , 
V4, 

Y, 
V5, 
V 6 , 
V7, 

Z, 

P 



Z="Omirou" 

and predicate (E-HUB2 ) and predicate (C-HUB2 ) 
and predicate (E-HUB3 ) and predicate (C-HUB3 ) 
and predicate (E-HUB4 ) and predicate (C-HUB4 ) 
and predicate (E-HUB5 ) and predicate (C-HUB5 ) 
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The expressions predicate (VAR) ensure that the corresponding edges have a 
proper inherited coverage. Therefore, each expression predicate (VAR) must be 



replaced by 


VAR. 


_icw{Wl } 


= Wi 


and 


VAR. 


_icw{W2 } 


= W2 


and 


VAR. 


_icw{W3 } 


= W3 


and 









where Wi, W2, W3, ... are the integers that correspond to worlds of the respective inher- 
ited coverage qualifier: for E-HUB2, C-HUB2, E-HUB3, and C-HUB3 the inherited 
coverage qualifier is [season=winter] , while for E-HUB4, C-HUB4, E-HUB5, 
and C-HUB5 the inherited coverage qualifier is [ season=suininer ] . The edges that 
correspond to variables E-HUB 1, C-HUBl, E-HUB 6, and C-HUB6 can have any 
inherited coverage because their implied inherited coverage qualifier is the empty 
context [ - ] , thus they are not included in where. 

Using the above framework, it is straightforward to translate MQL queries that 
contain multidimensional object variables. Actually, the analytical form of our MQL 
query example contains the multidimensional object variables <Vl>, <V2>, <V4>, 
<V5>, <V7>, and <V8>, which correspond to the variables Vl, V2, V4, V5, V7, and 
V8 of the equivalent Lorel query. In addition, it is easy to accommodate explicit con- 
text qualifiers. A facet part : : [Ci] [Cg] will result in a predicate of the form: 





VAR. 


_icw{Wl } 


= Wi 


and 


VAR. 


_icw{W2 } 


= W2 


and 

and 


VAR. 


_icw{W3 } 


= W3 


and 


VAR. 


_ecw{W2 } 


= W2 


and 

and 


VAR. 


_ecw{W6 } 


= Wg 



where Wi, W2, W3, ... correspond to the worlds of the inherited coverage qualifier 
[ Ci] , and W2, Wg, ... correspond to the worlds of the explicit context qualifier [ Ce] . 



Context Variables and “within” Clause. MQL uses an additional within clause to 
express conditions on contexts. Consider the MQL query: 

select comments: Y 

from music_club. [X] review. comments Y 
within [X] * [detail^high] <= [lang^gr] 

The context variable [X] binds to the path inherited coverage of the path 

review: : [-] .comments: : [-] 

and the condition in within requires that the context intersection (denoted *) be- 
tween this path inherited coverage and [detail^high] be context subset (denoted 
<=) of [lang=gr]. Consequently, this condition ensures that the query returns 
comments facets in Greek in high detail (node &;2 0 in Figure 1). Note that there are 
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more intuitive ways to express the same query in MQL, but with less demonstrative 
value. 

The first step is to express context specifiers as Lorel queries. Suppose that [de- 
tail=high] represents the worlds that correspond to the integers Wi, W2, and W3. 
The following Lorel query evaluates to the respective nodes: 

select W 

from music_club . # ,_icw W 
where W=Wi or W=W 2 or W=W 3 

Lets use L [<jetaii=high] to refer to this query, and L[iang=gr] to refer to an analo- 
gous Lorel query that expresses [lang^gr]. In addition, we use the symbol 
Lcxt_var to refer to a Lorel query expressing the path inherited coverage bound to the 
context variable [X] . This Lorel query is: 

E-HUB2._icw intersect C-HUB2._icw 
intersect 

E-HUB3._icw intersect C-HUB3._icw 

The query evaluates to the “worlds” under which all edges of the path hold. Now 
that we have expressed all contexts as queries evaluating to sets of nodes that repre- 
sent worlds, we can express context subset as a relation between the queries. Assum- 
ing that queryl expresses a context [ cl ] and query2 a context [ c2 ] , the condi- 
tion [cl] <= [c 2 ] ([cl] context subset of [ c2 ] ) is implemented by the predi- 

cate: 

for all LEFT in (queryl) : 

exists RIGHT in (query2) : LEFT = RIGHT 

where LEFT and RIGHT are Lorel variables that range over the “worlds” to the left 
and to the right side of the symbol <=, respectively. 

The MQL query can now be translated to the following Lorel query: 

select comments: Y 
from 

music_club E-HUBl, E-HUBl._ett Vi, 

Vl._facet C-HUBl, C-HUBl._cxt V2 , 

V2. review E-HUB2, E-HUB2._ett V3 , 

V3._facet C-HUB2, C-HUB2._cxt V4 , 

V4. comments E-HUB3, E-HUB3._ett V5 , 

V5._facet C-HUB3, C-HUB3._cxt Y 
where 

for all LEFT in 

(LgxT_VAR intersect L[detail=high] ) : 

exists RIGHT in (L[iang=gr] ) : 

LEFT = RIGHT 

By combining in similar ways Lorel queries that express sets of worlds, it is 
straightforward to implement any context condition in the within clause. 
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4.3 Transforming OEM Results to M.D.G. 



LORE returns the result of a Lorel query as an OEM graph. As stated, this OEM 
graph can he transformed to a Multidimensional Data Graph, which is the result of the 
original MQL query. 

The process that transforms an OEM O = (V, E, r, v) to a Multidimensional Data 
Graph M is given below. 



M ^ OEMToMDG (O) is: 

1. Represent O as a Multidimensional Data Graph M = E^,,,, E^„, r, v), where 

= V, E „ = E, and E are empty sets. 

2. For every edge h = (q, 1, u) e E^„ where 1 is not a reserved label, remove u from 

Then remove h and (u, _ett, p) from E^„, and add the edge (q, 1, p) to E^„. 
Remove all edges (u, _icw, w) from E^„. 

3. For every edge h = (q, _f acet, u) e E^„, move q from to (if not already 
moved), and remove u from For all nodes w, where (u, _ecw, w) e E^„, apply 
context union to the corresponding worlds to get a context specifier c. Then re- 
move h and (u, _cxt, p) from E^„, and add the edge (q, c, p) to E^^,. Remove all 
edges (u, _ecw, w) and (u, _icw, w) from E^„. 

4. Remove from all nodes that correspond to worlds (unreachable from the root at 
this time). 

5. Return M = (V„,„ E^„, E^,„ r, v). 



Notice that, in order to reconstruct context specifiers, step 3 needs the same map- 
ping of worlds to integers that was initially used while transforming the MOEM data- 
base to OEM. 
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Fig. 4. Evaluating an MQL query and displaying the results in MSSDesigner 
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4.4 Prototype Implementation 

Our prototype system is implemented in Java and interfaces with LORE, which is 
used as a hack-end. The system is initialized with an MOEM database, receives MQL 
queries, and returns Multidimensional Data Graphs as results. The system actually 
constitutes the Query Subsystem of MSSDesigner [11], a more general platform for 
managing multidimensional semistructured data, and is shown in Figure 4. It relies on 
this platform to perform basic functions, like carrying out context operations and 
calculating the inherited coverage of MOEMs. 



5 Conclusions 

In this paper we demonstrated some of the benefits of treating context as first-class 
citizens in Web data models and query languages. We briefly introduced MQL, a 
context-aware query language for semistructured data, and discussed in detail an 
evaluation process for MQL queries. We defined a transformation from MOEM 
graphs to corresponding OEM graphs and vice-versa, and specified how MQL queries 
can be translated to equivalent Lorel queries. We presented a prototype system that 
implements the above, and uses LORE to evaluate MQL queries. This evaluation 
process gave an opportunity for an intuitive comparison between the two query lan- 
guages and data models: MQL and MOEM are much more elegant and expressive 
when context is involved, while they become as simple as Lorel and OEM when con- 
text is not an issue. Moreover, MQL and MOEM directly support cross-world queries, 
which have no counterpart in context-unaware query languages. 
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Abstract. Structural recursion is a graph traversing and restructuring 
operation in UnQL [7], [8], a query language for semistructured data. 
In this paper we consider satisfiability questions mainly in the presence 
of schema graphs [2], [9], which are used for describing the structure 
of semistructured data. We introduce a new kind of simulation between 
schema graphs, with which the relationships can be represented in more 
subtle ways. By means of operational graphs we also develop a new way 
for defining the semantics of structural recursions. Our results give us 
algorithms for checking whether a given query will satisfy the restric- 
tions imposed by schema graphs and techniques with which these can be 
involved in queries. Query optimizing methods are also developed. 



1 Introduction 

In our days it is almost a banality to talk about the importance of semistructured 
[1] and XML [6] data. They are modelled with very similar structures, namely 
rooted, labeled, directed graphs, hence several questions of these researh fields 
can be discussed simultaneously [2]. However, there are also differences. Several 
query languages were developed for both structures: for instance Lorel [4], UnQL 
[7], [8] for semistructured data and XSL [11], XML-QL [12], XQuery [5] for XML, 
just to mention some. Each of them contains operations by means of which data 
graphs can be traversed and restructured. In our paper we pay attention to such 
an operation, the structural recursion. It was introduced in UnQL [7], [8]. It is 
related to top-down tree transducers [16], but it is more powerful than these 
because it can express joins and cartesian products [7]. Here we consider only 
structural recursions without conditions. Bunemann et al. have proven that they 
are closed under composition and offered powerful optimizations [7]. Dan Suciu 
showed their very advantageous characteristics in distributed systems [17]. In 
addition they form the core language of XSL [11], the first commercial XML 
query language. There is no doubt that structural recursions have their interests 
of their own and may become a useful tool on other fields of computer science. 

The data model of UnQL uses edge labels. An example can be found in 
Figure 1. There is a simple syntax for textual representation of trees. Structured 
values (internal nodes) can be denoted as {h : fi, . . . , t„}, here U stands for 
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labels and ti for other structured or atomic values. {} denotes the empty graph, a 
graph consisting a node only. Note that {^i : ti, . . . , : t„} is a set of label/value 

(label/tree) pairs in contrast to other models, where to each node a unique object 
identifier is assigned. Thus the UnQL model, like the relational model, is ’’value 
based” [7]. Since sets of label/tree pairs are considered, trees can be constructed 
by means of three basic constructors: the union, U, the singleton set, {1: t} and 
the empty graph, {} [7]. 

We present UnQL through an example. We shall not enter into details, the 
interested reader should consult [7], [8]. 

Q:= select {result: |D: {owner: {name: C}} U 
{ (select{address : E} 

where {D: {owner: {narnie: C} , address: E}} in db)}}} 
where {D: {owner: {name: C}}} in db. 

The query accomplishes a group by operation, i.e., under an owner edge it inserts 
the owner’s name and the set of the addresses of their restaurants. UnQL has a 
simple select . . .where . . . structure with pattern matching. Patterns are given 
in terms of regular path expressions. Nested queries and other possibilities can 
be used. 

Semistructured data is often described as ” schema-less” or ’’self-describing”, 
since information, which is part of the schema in traditional databases, is inter- 
mingled with data [2] . However, even partial knowledge of the structure can help 
in improving storage, optimizing query evaluation etc. [2], hence various methods 
were developed for this purpose. One of them, which suits well the data model 
of UnQL, uses schema graphs and dual schema graphs [9]. The optimization of 
regular path expressions in the presence of schema graphs were discussed in [14]. 
Regular path expressions can be encoded as structural recursions [7]. One of our 
method is a generalization of a result of [14] . 

In the course of static analysis certain properties of queries given with their 
syntax is examined without running them. One possible question is that, for 
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a given query q whether there exists an instance / s.t. q{I) is not empty. In 
case of relational algebra the preceding problem is undecidable in general [3]. 
Hence it is also undecidable in the case of UnQL, because for tree data graphs 
encoding relational databases and queries mapping such tree data graphs to tree 
data graphs also encoding relational databases, the two languages have the same 
expressive power [7]. 

For answering this satisfiability question in case of structural recursions we 
shall present a new way for defining the semantics for them. Using this it will 
turn out that our problem can be answered easily in linear time. 

Owing to its relation with query optimization mentioned in the first para- 
graph, we shall also examine the satisfiability question in that case, when the 
inputs and outputs of a query q are restricted by means of schema graphs and 
dual schema graphs respectively. We introduce a new kind of simulation between 
schema graphs to be able to represent more subtle relationships. Our algorithms 
answering these problems also give us methods for checking whether the result 
of a given query will satisfy the above restrictions and query rewriting tech- 
niques with which these can be involved in the query. We shall also present two 
optimization possibilities. One of them is an easy generalization of the result of 
[14] as it was already mentioned. The other one is different from it, but in the 
background we always use the same idea given by our semantics. If we set aside 
the time needed to check the satisfaction of a unary formula, then all algorithms 
work in polinomial time. 

The usefulness and strength of our semantics become more transparent, if 
structural recursions with conditions introduced in the underlying algebra of 
UnQL, UnCal, are examined. Here there may be unnecessary conditions. These 
should be eliminated. A natural extension of our semantics will turn out to be a 
very good tool for describing the complex relationships between conditions. We 
plan to review our results in another paper. 

In section 2 the basic concepts are described formally. In section 3 the track 
semantics is introduced. The algorithms answering static analitical questions, 
checking restrictions and the optimizing methods are in section 4. In section 5 
we present the query rewriting technique through an example. 



2 Basic Concepts 

As it was mentioned semistructured data can be modelled with rooted, edge- 
labeled, directed graphs. Formally, let 14 be the universe of all constants {14 = 
IntUStringUBoolU . . .) and it also contains a special constant, e. For those who 
are familiar with automata theory, the role of e can be described as it is similar 
to that of silent transitions. Then a data graph DB is a triple, DB = (U, E, vq), 
where V is the set of nodes, EcVxUxV is the set of edges and vq is 
the distinguished root. We shall use the notations V.DB, E.DB for the sets of 
nodes and edges of a data graph DB respectively. Usually atomic values are 
represented by the leaves, however, in case of UnQL for the sake of convenience 
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Fig. 2. The data graph in Figure 1. conforms dually to the schema graph in (a) and 
conforms to the one in (b). The schema graph in (b) subsumes the one in (a). 



it is assumed that the last edges of branches represent them. See Figure 1. for 
an example. A path in a graph is a sequence of subsequent edges of that graph. 

Two data graph is considered equivalent if they are bisimilar. Formally, two 
data graphs DB, DB' are bisimilar if there exists a binary relation ~ between 
V.DB and V.DB', s.t. (i) vq ~ Vq, where vo, Vq are the roots (ii) whenever 
u u' , u € V.DB, u' G V.DB', {u,e*.a,v) G E.DB then 3v' G V.DB' s.t. 
{u' ,£* .a,v') G E.DB' v v' and if {u',e*.b,v') G E.DB' then 3v G V.DB s.t. 
(u,£*.b,v) G E.DB, V ££ v' [7]. Here {u,e*.a,v) G E.DB is a sloppy shorthand 
for that there is a path from m to w starting with arbitrary number of £ edges 
and ending with an a edge. If there exists a bisimulation between DB and DB' , 
then there always exists a maximal bisimulation which can be found in time 
0(mlog(m + n)), here n = \V.DB\ + \V.DB'\, m = \E.DB\ + \E.DB'\ [7], [15]. 
In the sequel we shall always consider this maximal bisimulation. It can be seen 
that it is enough to take into account that subgraph, which is reachable from 
the root, for it is bisimilar to the whole graph. It is also clear that the concept 
of equivalence introduced by bisimulation well fits the set semantics of trees [7]. 

Now consider a set of base predicates Pi, P 2 , ■ . . and an interpretation over 

U. A unary formula is a formula with at most one free variable. A schema graph 
is a rooted, labeled, directed graph, whose edges are labeled unary formulas [9] . 
Typical predicates include Int{x), String{x) etc. and user-defined unary predi- 
cates, P{x). The generated theory T is decidable [9]. A schema graph is a rooted, 
labeled graph, whose edges are labeled with unary formulas [9] . 

A data graph, DB, conforms to a schema graph S, DB S', if (i) 
Vq Vq, where Vq, v'q are the roots (ii) whenever u u' , u G V.DB, u' G 

V. S, (u,£*.a,v) G E.DB then 3v' G V.S s.t. {u' ,s* .p,v') G E.S, v v' and 
14 ^ p{a) [9]. In this case one cannot enforce the presence of a label with some 
property on a level of a data graph, only the allowable labels on that level can 
be specified. An example is shown in Figure 2.b. 

In case of its dual counterpart the possibilities are reversed. One can specify 
the required edges on a level, but cannot avoid the appearance of arbitrary 
labels there. There is a hint to this notion in [2], however, as far as we know the 
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correct definition is given in this paper first time. A data graph DB conforms to 
a schema graph S dually, S DB, if (i) vg =^~ Vg, where vg, v'g are the roots 
(ii) whenever u =^~ u' , u G V.S, u' G V.DB, (u,e*.p,v) G E.S then G V.DB 
and 3a s.t. {u',e*.a,v') G E.DB, v v' and U |= p{a). See an example in 
Figure 2 . a. Note that data graphs can be considered as special schema graphs, 
because constants in U can be substituted with unary predicates becoming true 
iff the constant in question appears in its argument. For a given constant a, 
denote the appropriate predicate a{x) or pa- Hence in the rest of the paper we 
shall sometimes blur the distinction between constants and predicates, and we 
shall consider data graphs as special cases of schema graphs. 

In the sequel we shall need three basic notions concerning schema graphs. 

(a) Let Si, S2 be schema graphs. S2 subsumes S'!, S'! S2, if (i) r'Oi 

W02, where voi, vg^ are the roots (ii) whenever ui U2, u\ G V.Si, U2 G 
V.S2, {ui,e* .pi,v\) G E.Si then 'ia & U,U \= pi{a), there exists V2 G V.S2 and 
{u2,s* .p2,V2) G E.S2 s.t. U ^ P2(a), v\ V2 [ 9 ]. An example can be seen in 
Figure 2 . The term subsume is well-justified, because if [S'] = {DB \ DB S}, 
then one can prove that Si S2 iff [Si] C [S2] [ 9 ]. Si, S2 is considered equiv- 
alent, if both Si S2 and S2 Si holds. It can be checked in time 
whether Si S2, where t is the time needed to check the validity of a sentence 
in the theory T [ 9 ]. 

(b) Sometimes the previous property is too strong for describe the relationships 
between schema graphs. Hence we introduce the dual counterpart of this relation, 
which is introduced in this paper first time as far as we know. S2 subsumes Si 
dually. Si S2 if (i) uoi vg^, where vqi, vg^ are the roots (ii) whenever 
ui =4~ U2, ui G V.Si, U2 G V.S2, {ui,e* .pi,vi) G if. Si then 3a GU,U \= pi{a) 
and 3^2 G V.S2, {u2,£* ■P2,V2) G E.S2 s.t. Vi =4~ V2 and U ^ ^2(0)- Note that 
Si S2 informally means that S2 has a subgraph with the ’’same” structure 
as Si. 

(c) The intersection of schema graphs will play a decisive role in our algorithms 
and proofs. Denote S the intersection of Si and S2, S := Si □ S2. Then (i) 
V.S = {{ui,U 2 ) I Ui G V.Si, {i = 1 , 2 )} (ii) E.S = {((mi, M 2 ),P, (ui, V2)) | 
(ui,U2), (vi,V2) G V.S, p = Pi A p2, where {ui,Pi,Vi) G E.Si {i = 1 , 2 ) and 
3 a G U,U (= pi{a) Ap2{a)} [ 9 ]. See an example in Figure 3 . The followings can 
be shown: (i) S Si (z = 1 , 2 ) (ii) if S' Si (i = 1 , 2 ) also holds, then 
S' s (iii) [S] = [Si] n [S2] [ 9 ]. 

£ edges can be eliminated similarly as in the case of automata. Namely, for 
a given £ edge, add edges to its starting node s.t. all nodes, reachable through 
a p edge from the endpoint, should also be reachable from it through a p edge. 
An example can be found in Figure 3 .e. Note that a schema graph is always 
equivalent with its counterpart without £ edges [ 7 ]. 

In the sequel we shall need to consider the union of arbitrary schema graphs 
Si, S2. Let u be a new node different from the nodes of Si and S2 respectively. 
Add £ edges from u to the roots. This new graph is defined to be the union of 
Si and S2. Note that in case of tree data graphs the two semantics of union is 
equivalent [ 7 ]. An example can be found in Figure 3 .f. 
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Fig. 3. (a), (b) schema graphs, (c) the intersection of the schema graphs of (a) and 
(b), (d) the maximal subgraph of the schema graph in (c) reachable from the root, (e) 
an example for the elimination of e edges, (f) the union of schema graphs 



Eventually we explain by means of an example how structural recursions 
work on trees. The semantics on arbitrary data graphs will be described later. 
Consider the following two structural recursions: 

/l(tlUt2) = /l(^l) u /i(t2) /2(^l U ^ 2 )= /2(fl) U /2(t2) 

fi{{owner: t})= {owner : /2(t)} /2({C t}) ={l: /2(t)} 

h{{l-.t}) =h{t) /2({}) ={} 

/i({}) ={}• 

These stuctural recursions copy the subgraphs under an owner edge of the data 
graph in Figure 1. The semantics of structural recursion is the following: as it 
reaches a node which has several outgoing edges, it calls itself recursively on 
each branch, and it takes the union of the results, /i(ti U ^2) = /i(^i) U /i(t2) 
(there are several ways to split the graph as ti U t2, however, all choices leads 
to the same result). It does the construction eagerly as it processes an edge 
with a label followed by a subgraph, /2{C t} = { 1 : f2{t)} (here I match every 
labels) . In the example it can be observed that recursive calls to other structural 
recursions are allowed. Some syntactic restrictions were introduced in order to 
guarantee termination [7]. These are the followings: (i) the right side of f{tiUt2) 
should always be /(ti)U/(t2) (h) each result of recursive calls on the right side of 
f{{l : t}) can be used only by constructors, calls like /i(/2(t)), f{{a- {b'. t}}) are 
not allowed (iii) the result of /({}) should always be {}, recall that {} denotes 
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Algorithm: Construction of the Operational Graph 
Input : / = (/i,---,/n) 

Output : Uf 

Method : used ~ {}; default := _L; /* _L denotes the unsatisfiable predicate * / 

!* assume /i has the control first */ 

create a new node, label it with /i; 

used := used U {/i}; builder(next-row{fi), /i); 

builderilabeh, labels, control, current): 
if labeli yf null then 

if control has not been processed in course of the algorithm then 
create a new node labeled control-, 
if labeh = I then labeh := default-, 

else default := default A ^labeli(x); 
draw an edge from current to control, label it with labeh and labeh-, 
if control not in used then used -.= used U {control}-, 

builder ( next^row (control), control)-, 
if current control then builder (next-row (current), current)-, 
else return(O); 

Fig. 4. The algorithm constructing the operational graph of a structural recursion. 



the empty graph. Since the f(ti Ut 2 ), /({}) rows are always the same, we shall 
write down only the remaining parts of structural recursions in the sequel. Note 
that the order of rows is important, because for example an owner edge of the 
input in /i matches both {owner: t\ and {C t}. In such cases the first matching 
row counts. 

3 The Track Semantics of Structural Recursions 

In this section we shall develop a new way for defining the semantics for structural 
recursions on arbitrary inputs. First we shall construct an auxiliary graph, the 
operational graph Uf for a structural recursion / = (/i, . . . , fn). Here / consists 
of n recursive structural recursions. By means of its operational graph one will 
be able to track the run of a structural recursion. In addition it tells which 
subgraphs of the input may have an impact on the result. It also ’’describes” 
the structure of the output, and it will establish the connection between schema 
graphs and structural recursions. The algorithm constructing the operational 
graph can be found in Figure 4. 

The nodes of Uf are labeled fi~s (I < i < n) and its edges labeled binary 
lists of constants or predicates. If ft constructs a b edge as a result of an a edge 
and calls fj for the subgraph under the a edge, then there is an (fi,[a,b], fj) 
edge in Uf. The next-row (fi) (1 < i < n) function takes the subsequent row of 
the last processed row of fi. If neither of its rows has been processed yet, then 
it takes the first row. It returns three values: the label in the singleton set on 
the left hand side of the equation and the label and the name of the structural 
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recursion in the singleton set on the right hand side respectively. If there is 
no other row to proceed, then it returns three nulls. If the label on the right 
side misses, i.e, nothing will be constructed, then the second value becomes £. 
These values are given to the variables labeli, label2, control in the argument of 
the function builder. The fourth parameter tells which structural recursion has 
the control now. Virtually the function builder constructs the operational graph 
calling itself recursively. Note that the construction can be done in 0 (n) time in 
the size of the input (here the size of the input is the number of rows in /). As 
an example, consider the following structural recursions: 



/i : ({a: i})={a: f2{t)} 
{{b-.t}) = {b-.Mt)} 
{{i-.t}) = Mt) 

f2 ■■ ({a: t})=fi{t) 
{{l-.t})={l-.Mt)} 



h ■■ ({c: 0)={a: 

{{d-.t})={} 

u ■ {{d: t})={e: U{t)} 



Uf, / = (/i, . . . , /4), can be found in Figure 5. a. 

Now we are to define the semantics by means of operational graphs. The 
processing of an input consists of three steps. Suppose that we are to process the 
subgraph t, which is assumed to be a tree at first, (i) Split t into branches, process 
each of them and then take the union of the results, (ii) Assume that we are to 
process the {u, ai,v) edge of t according to fi. Because of the construction of the 
default label, there is a unique edge from fi, (fi, [p, bi], fj), s.t. U |= p{ai). The 
result of (m, ai,v) will be an edge (u' , b\,v') and fj will carry on the processing. 
Remember that an e edge means that nothing will be constructed. Assume that 
next we are to process {v, 02, w) and its result is (u", b2,w'). Then add an e edge 
from v' to v” in the result. Note that u' should also be linked to the endpoint 
of the result of the edge processed right before {u, ai,v). (iii) If t has cycles, 
then for a given cycle assume that {u' ,b\,v') was constructed as a result of its 
first processed edge, and {w',b2,z') as a result of its last processed one. Then 
add an s edge from z' to u'. In this paper we consider looping edges as cycles. 
Eventually s edges can be eliminated. An example can be found in Figure 5.b, 
which uses the operational graph of the previous example. 

We shall call this method as the track semantics of structural recursions. Its 
equivalence with the other semantics defined in [7] is quite natural. 



4 Static Analysis and Optimizing Methods 

4.1 Preparing Concepts and Observations 

For answering static analitical questions, first we shall need some concepts. Con- 
sider only the first elements of edge labels in Uf, this graph will he Df. Similarly, 
the second elements will give Gf. Note that Dj is deterministic, i.e., $a Ghl s.t. 
U \= pi{a) A p2{a), where pi, p2 are labels of two neighbouring edges in Df. 
Next note that if Si subsumes (dually) S2, Si =4 S2, then each node and each 
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Fig. 5. (a) an operational graph (b) an example of the track semantics 



edge of S\ have at least one pair in S2 given by the ^ relation. The subgraph 
constituted by these nodes and edges are called the correspondence of Si in S2, 
in notation Formally V.S^^ := {u2 \ U2 G V.5'2, G V.S'i, ui =4 U2} 
and £.82^ := {{u2,P2,V2) \ {u2,P2,V2) G E.S2, ^{ui,pi,v{) G E.Si.ui ^ 
U2, vi =4 t>2, 3 a G U,U ^ Pi (a) A ^2(0)}- The e-extension of a subgraph 8 
of a schema graph 8, in notation is the following: (i) V.8‘^ := {f | 1; G V.S 
or 3(m, e* ,v) G E.S s.t. u G (ii) E.S^ := {(m, c, v) \ \u, c, v) G E.S or c = e 
and u, V G 

Next we make some observations, which will turn out to be useful in the 
sequel. First note that for an arbitrary edge of Up, there exists an instance I, 
s.t. this edge is traversed in the processing of I. Namely, take a path in Df ending 
in the corresponding edge. Leave the node labels, and change the formulas to 
constants satisfying them. Let I be this path. It is obvious that I has the required 
property. Similarly, we get a lemma as follows: 

Lemma 1 For an arbitrary rooted subgraph of Uf with the same root as Up, 
there exists an instance I, s.t. this subgraph is traversed in the processing of I . 

Note that the question that for a given structural recursion / whether there 
exists an instance / s.t. /(/) is not empty, can be answered easily by means of 
Lemma 1 . Namely, the following statement holds: there is an instance I for a 
given structural recursion / s.t. /(/) is not empty iff G/ contains at least one 
edge having a label different from e. Moreover we have another lemma: 

Lemma 2 For a given instance I and structural recursion f = (/i,...,/„), 
{ui,pi,vi) G I will be processed by the structural recursion fi, and fj will get 
the control afterwards iff {{uj,uuj),pi ApDj,,{vi,vuj,)) G E.inDf. Flere fi, fj 
are the node labels ofu^j, and vof respectively. 

Proof. Assume that the first edge of I, which is to be processed, is (u/, a, vi). If 
{{ui,UDf),Pa^PDj,{vi,VD;)) G E .lUD f , thenU ^ (a). Hence (ui,a,vi) will 

be processed according to {uUf,PDf:Vjyj,). Conversely, if (ui,a,vi) is processed 
according to (uof , POf , vof ) , then ( (m/ , udj ),PaA poj , {vj , vdj )) is in E.IH Df . 
Note that this edge in Uf is unambiguous, since Df is deterministic. The proof 
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can be continued in a similar manner. Note that each node of / has at most one 
pair in V.I r\ D f and hence in Df. 

At the end of this subsection we prove four easy technical lemmas. 

Lemma 3 Let S\, S2 be schema graphs and I an instance s.t. Si I S2, 
then Si S2- 



Proof. We construct a relation between the two schema graphs in the fol- 
lowing manner. First Vi^ V2o> here U2o are the roots. Suppose that 
Ui U2, Ui € Si, U2 € S2 and 3 uj € I, Ui uj, uj U2, the three roots 
meet this requirement. Consider (ui,pi,vi) € E.Si. Then 3 {ui,a,vi) € E.I 
s.t. vi Vi and U N pi{a). Moreover 3{u2,P2,V2) G E.S2 s.t. vj V 2 and 
U \= P2(o)- Thus the relation can be extended with vi = 4 ~ V2- The proof can 
be continued inductively. 



Lemma 4 Let Si, S2, S3 be schema graphs. If Si S2 and S2 = 4 ^ S3, then 
Si 4 + S3. 



Proof. Recall that Si S2 iff [S'!] C [S'2] [ 9 ]. Owing to the transitivity of the 
C relation, the relation is also transitive. 



Lemma 5 Let Si, S'2 be schema graphs. Suppose that Si S2. Then Si is 
equivalent to Si □ S2. 

Proof. From the (i) property of intersection Si □ S2 Si holds. Furthermore 
Si Si trivially, Si S2 was supposed, hence from the (ii) property of 
intersection Si Si □ S2. 



Lemma 6 Let Si, S2, S3 be schema graphs. Then (Si □ S2) FI S3 is equivalent 
with Si n (S2 n S3). 

Proof. Applying the definition of intersection we get the followings: V.(Si □ 

52) n S3 = {((■ui,'U2),'U3) I (ui,u2) G V.Si n S2, U3 G R.S3}, E.{Si n S2) n 

S3 = {{{{ui,U2),U3), {{pi AP2) Aps), {{vi,V2),V3) \ {{ui , U2) , U3) , ((^^l , U2) , W3) G 
y.(Si n S2) n S3, {{ui,U2),pi Ap2, (vi,V2)) G E.Si n S2, (U3,P3,V3) G E.S3 and 
3 a GU,U (pi(a) Ap2(a)) Ap3(a)}. 

y.Si n (S2 n S3) = {(wi, (u2, U3)) \ {u2, U3) g V.S2 n S3, ui g y.Si}, e.Si n (S2 n 

53) = {{{UI,{U 2 ,U 3 )),{PIA{P 2 AP 3 )),{VI,{V 2 ,V 3 )) \ {ui , {U2, U3)) , (l^l , (U2 , W3)) G 
y.Si n (S2 n S3), {{U2,U3),P2 Aps, (V2,V3)) G E.S2 n S3, (ui,pi,vi) G E.Si and 
3 a G U,U pi{a) A (^2(0) A ^3(0))}. It can be seen that M.(Si □ S2) FI S3 = 
y.Si F (S2 F S3) and E.{Si F S2) F S3 = E.Si F (S2 F S3). 
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4.2 Practical Problems and Their Solutions 

The preceding lemmas have some corollaries, which may play an important role 
in real world situations. Firstly suppose that one would like to insert some data 
from a database into another one in a well-known insert into-select-from-where 
fashion, but here an UnQL query is used instead of an SQL query. Assume that 
this query can be translated into a structural recursion /. It is a certain assump- 
tion that the target database has some restrictions, which we can translate to our 
’’jargon”, as there exists two schema graphs Si, S 2 s.t. Si f{I) S 2 should 
hold for every instance I. Note that in case of XML documents there may be sev- 
eral properties prescribed by DTDs that can be expressed using schema graphs in 
such a manner. Hence one may also apply the following discussions to that case. 

Secondly suppose that the inputs are typed, i.e., all of them should conform 
to a given schema graph S. This can be the situation, when we are to process 
results of another query, for example. It can also be a natural question that for 
a given structural recursion /, whether there exists a typed input s.t. the result 
is not empty. These are both satisfiability questions and as we solve them, we 
shall also develop some methods for checking these restrictions as well as for 
optimizing query evaluation. 

For the first problem remember that in the proof of Lemma 2. we have 
observed that each node of the input has at most one pair in Df as we process 
it. Note that these are given by . Since both Df, Gf are derived from Uf, 

it is obvious that each node and edge of Df have a corresponding node and edge 
in Gf. For a given subgraph of Df, D'f, call the corresponding subgraph as the 
G f -correspondence of D'f. Note that the Df -correspondence of a subgraph of Gf 
could be defined similarly as well as the U f -correspondence. Moreover we can 
say that each node of the input has a corresponding node in Gf. In the course 
of processing do the following: whenever an e edge is drawn as a consequence 
of a cycle in /, draw an edge between the appropriate nodes in Gf. Denote 
this new graph Gy. Clearly, from the track semantics /(/) Gy holds. The 

relation are constituted by means of the aforementioned pairs of nodes of / 
and Gy respectively. Thereafter for a given node m in Gy, add e edges from the 
nodes reachable from u to u. If it has not been done, then add also a looping e 
edge to u. Denote this graph G/,e. Eliminate the £ edges and denote the result 
graph Gy. Obviously for every instance I, Gf Gf. Consequently by Lemma 
4. /(/) Gy holds. Now, we get a nearly straightforward proposition. 

Proposition 1 For a given structural recursion f and schema graph S there 
exists an instance I s.t. f{I) S holds iff S Gf- 

Proof. If there is an instance I s.t. S =4~ f{I), then by Lemma 3., S Gy. 

If S Gf, then consider Gy. It can be seen that S S □ Gy. Namely let 
{u,p, v) be an edge of S. Here u denotes the root. We know that there is an edge 
{u',p',v') in Gy s.t. V =4~ v' and 3a, U N p{a) Ap'(a). Here u' denotes the root. 
Then {{u,u'),pAp' , (v,v')) will be an edge of SflGf. Moreover u {u,u') and 
V = 4 ~ {v,v') hold. The relation between S and S' □ Gy can be constructed 
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further along the same lines. We also know that SnGf S holds. Hence S and 
SriGf have the same structure, so G^ and are equivalent to each other. 

Consider the ^/-correspondence of . By Lemma 1., there is an instance 

/ s.t. this graph is traversed as we process I. From the definition of the track 
semantics S f{I) holds clearly. Note that S' □ G/,e would not have a sense, 
because the intersection was not defined for schema graphs having e edges. 

The proof also gives us a method for checking that whether the output of a 
given input will satisfy the restriction in question or not. So we would like to 
prescibe that /(/) should conform dually to a schema graph S, i.e., on each level 

the required edges are given. It has become clear, that only is important 

from this point of view, i.e., its edges should be used in the construction. Hence 
clearly should be equivalent with S, otherwise the restriction could not 

be satisfied. Note that by means of G/,e one can rewrite the s edges in , 

hence we get a subgraph, , of G/,e- Denote its ^/-correspondence Df _■ 

Obviously Dj_ should be traversed for the satisfaction of the restriction. Hence 
D/ _ should be a subgraph of . This is a necessary and sufficient condition, 

if is equivalent with S. We have just proven the necessity. However, the 

sufficiency is also straightforward from the preceding train of thoughts. 

On the other hand, if one would like to prescribe that /(/) should conform 

to a schema graph S, f{I) S, then by similar train of thoughts Gj ^ 
should be considered again. Remember that here on each level the allowable 

edges are given. Consequently edges of G/ not in Gp^^ should not be used 
for constructions. Certainly e edges do not have an impact on the result. Hence 

the e-extension of Gpp^ should be considered. Denote its ^/-correspondence 
As it can be seen only the edges of Dj can be used in the processing, 

i.e., should be equivalent to I. This a necessary and sufficient condition 

again. Moreover instead of In Df, /□ Dj^ should be taken into account in the 
construction of the result. Thus it is enough to use the [//-correspondence of Dj^ 
instead of [//. This could mean considerable improvement in query evaluation. 
In the next paragraph a similar result will be presented, whose effect was tested 
empirically in [14]. 

The beauty of the previous considerations is that, if we return to our second 
practical problem, when typed inputs are considered, the solution can be given 
along the same lines. Accordingly I conforms to a schema graph S, I S, and 
by Lemma 2. only Dp^^ is used for the construction of the result. By Lemma 4. 
In Df S also holds. Now as we apply Lemma 5. and Lemma 6. we get that 
I n Df is equivalent with I n Df n S. Hence in the process of an instance only 
the appropriate subgraph of S n Df should be considered instead of Df. This 
result is a slight generalization of another one given in [14], because regular path 
expressions can be translated into structural recursions. As it was mentioned 
this fact may eventuate considerable improvements again, when a query is to be 
executed [14]. 
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Fig. 6. Examples of a dnal schema (a) and a schema graph (b) for query rewriting. 



5 Query Rewriting 

Consider the first practical problem of Section 4.2 again, when the structure of 
the output was restricted by means of schema gaphs, i.e., keeping the notations 
S'! /(/) S2 should hold. Suppose that the user was careless and he or 

she has written a query, whose results cannot satisfy the restrictions, because 
of the structure of the query. There are two possibilities. Either the database 
system displays an error message, or the query is rewritten in such a way that 
the restrictions are involved in the query as in case of the chase algorithm in 
relational tableux [3] . This latter option can be achieved by means of the results 
of the previous section. 

Recall that the equivalence of and S\ was a necessary and sufficient 

condition for the satisfiability of the given restriction. Furthermore only the 
edges of _|_ should be considered in the processing. These two subgraphs and 
the schema graphs shows, how the syntax of the original structural recursion 
should be changed. We present the method through an example. From this it is 
not difficult to develop a general algorithm. Consider the data graph in Figure 
1. S\, S2 are given in Figure 6. a and b. For the sake of simplicity suppose that 
the following query was asked. 

f{{l:t}= {I -.fit)} 

The rewritten query will be the following. 

/ : {{owner: t})= {owner : g2{t)} U {owner : gi{t)} gi : ({/: t})={/: h{t)} 

{{tel : t}) = {tel : 51(f)} 52 : ({/ : t})= {/ : 51(f)} 

{{price: f}) ={priee: 51(f)} h: {{1: f})={} 

={} 

It can be seen that for the rewritten query there exist instances for sure s.t. 
their result satisfy the restrictions. Note that we cannot check whether an edge 
after a tel edge has an integer label. In this case we may assume that this is 
guaranteed by the source. As another opportunity structural recursions can be 
extended easily to be able to check such prescriptions. Namely, conditions like 
{Int{l): t} should be allowed on the left hand side for example. 
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Conclusions 

We have developed a new way for defining semantics for structural recursions 
by means of which the relationships between structural recursions and schema 
graphs can be apprehended. We have also given some methods for answering 
satisfiability questions, checking restrictions and optimizing queries. As it is usual 
after solving the satisfiability problems, questions of containment can also be 
solved. 

We have not discussed those cases, when the right sides of the equa- 
tions of structural recursions consist of a more complex structure, for instance 
{a: U {b: f 2 {t)} or {a: {6: than a singleton set. However, note 

that the track semantics can be extended without difficulties to handle such 
cases and the result can also be constructed in polinomial time. 

Yet we have not examined deeply the relationships between schema graphs 
and DTDs and how our methods should be changed in the presence of DTDs. 
Perhaps this will be done in the future. Structural recursion with conditions have 
also not been considered here, but we plan to analyse them in another paper. 
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Abstract. If vast quantities of data elements are considered, catalogues 
provide an intuitive way of organisation. While in common use the term 
catalogue refers to a tree-shaped specialisation hierarchy, we allow any 
transitively reduced acyclic digraph of a transitive relation such as the 
specialisation relationship to be a representation of a catalogue. This 
conforms to real-life scenarios, where each element can be classihed 
differently depending on the actual point of view. As shown by the 
application of catalogues for database integration purposes, the inherent 
definition of similarity among categories is extremely useful. In this 
paper, we investigate whether catalogues as a physical organisation 
method also have some benefit. We accomplish this task by precisely 
defining the data structure and thoroughly analysing the time and space 
complexity properties of its management routines. The results are also 
compared to those of relevant alternative organisation methods. 

Keywords: Physical organisation, catalogue, similarity query 



1 Introduction 

Efficiency in cost and time is a major issue in the development of computer sys- 
tems. Efficiency is also the key of business success at customers of such systems. 
As a consequence, solution providers are forced to deliver software for constantly 
evolving requirements. This phenomenon hinders software re-use literally, but 
makes vendors produce systems which utilise already existing solutions as sub- 
components. 

The previous statement also applies to information systems which manage 
massive amount of data. Their common property is that although the under- 
lying Database Management Systems (DBMS’s) support either navigation or 
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value-based search to query and update data elements, the business logic needs 
more sophisticated retrieval and management functions. Therefore, these must 
be implemented outside the DBMS, contrary to the original concept of data 
management. To illustrate this, consider the following simple scenario, which is 
rather typical in any data retrieval or analysing environment. 

Example 1 (Travel agency). Diverse information about holiday resorts is stored 
at a travel agency. For customers, probably the most important details are daily 
price, accessibility and facilities. On demand the best matching places should 
be returned for a given preference list. If the potential number of facilities and 
accessibilities is large, there is a great chance of not finding any resort which 
exactly matches the specification. Then approximate (quasi-similar) hits are to 
be delivered. While in the case of price, the definition of similarity is straightfor- 
ward, at accessibility and facilities there is no obvious interpretation of similarity. 
The most likely definition for this purpose is the inherent set inclusion. However, 
the usage of such a proximity function is not supported by any DBMS directly. 

Database management systems have been designed to provide efficient re- 
trieval and update of data elements based on the values of fields and inter-field 
relationships. But for most applications, relationships within single fields (intra- 
field) are of great importance nowadays. For instance, considering the travel 
agency again, set inclusion of facilities or accessibility is such a relation, they are 
necessary for the evaluation of similarity queries and other queries which specify 
minimum and/or maximum conditions to be satisfied. 

Although there is no hurdle creating new data models which are capable of 
representing intra- field relationships, without proper counterparts in the physi- 
cal layer such extensions cannot be exploited. Unfortunately, no physical organ- 
isation technique provides them currently. The commonly used methods either 
focus on inter-field relationships and create clusters)!], simplify the calculation 
of joined data elements[2], or they enforce predefined intra-field relationships 
(indexes [3]), or they fully neglect them (hashes [3]). This means that current 
physical organisation methods are fairly sub-optimal for the purposes of modern 
applications with sophisticated models. It should be noted, however, that sup- 
porting intra-field relationship-aware data models by the physical layer does not 
necessarily contradict the principle of (physical) data independence [4] since phys- 
ical organisation particulars can be changed without altering the data model. 

In this paper, we investigate if the use of so-called catalogues as a physical 
organisation method is reasonable to provide access to intra-field relationships 
that are defined by pre orders. These catalogues were originally designed to in- 
tegrate heterogeneous databases and described in details in [5]. To begin with, 
we briefly introduce catalogues in the next section. Since catalogues are mainly 
used to accelerate lookup-by-value operations, other physical organisation meth- 
ods with similar functionality are also pointed out there. Section 3 deals with 
space requirements of each method. Time estimation of duration of important 
operations is done in Section 4. Section 5 lists several options to improve the 
performance of catalogues and concludes the paper. 
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2 Catalogues and Quasi-alternative Methods 

Physical organisation methods can be classified into two groups based on their 
concrete goal. Some methods deal with the structuring of data elements so that 
dereferencing inter-entity links are on average inexpensive considering block ac- 
cesses. The rest of methods build auxiliary structures in order to speed up other 
operations (again, measured in medium accesses), e.g. the lookup by value, in- 
sertion, deletion, modification of data elements. 

As catalogues aim at representing intra-field relationships, they belong to 
the second group. It means that the efficiency of catalogues must be compared 
to other methods managing auxiliary structures. Interestingly, according to [3] 
none of the basic structures has fundamentally changed since it was enumerated 
in [4]. Among them only hashes and sparse indexes are considered in this paper. 
Dense and several other, less known indexes are omitted because they are less 
frequently employed and the space and time complexity of their management 
routines are similar to the ones of sparse indexes. Bitmap indexes are pretty 
young on the scene and they were originally used in Decision Support Systems 
to accelerate data search based on low-cardinality fields [6,3]. Since we impose 
no limit on the size of a field’s domain, bitmap indexes are unsuitable for our 
purposes and not included in our comparison either. 

There also exist more sophisticated structures for special purposes. For ex- 
ample, techniques for computing join, the key operation of the relational data 
model have their own literature (see e.g. [2,7]). But these methods are usually 
add-ons in the sense that they are not normally used in basic lookup operations. 
Consequently, we do not deal with them in this paper but we manifest that their 
interaction with catalogues must clearly be addressed in the future. 

Based on [5], we define catalogues for physical organisation purposes as fol- 
lows. 

Definition 1 (Catalogue). Let ^ be a given pre order on the universe of an 
attribute and ^ a partial order on the equivalenee classes specified by A cat- 
alogue is basically the digraph representation of ^ where each node is an equiv- 
alence class and reflexive and transitive links are omitted. Finally, each node of 
the catalogue, similarly to the leaves of sparse indexes, points to a set of data 
elements which contain any value assigned to the node in their corresponding 
attribute. 



Example 2 (prev. contd.). For the travel agency scenario, several simple data 
elements and their corresponding catalogues determined by different orders are 
shown in Fig. 1. In the case of price the usual arithmetical order, while at 
accessibility and facilities set inclusion are employed as orders. Inter-node edges 
of the catalogues are depicted by — arrows. Entities are identified by the 
abbreviations of their name, and are connected to the nodes by i — > arrows. 
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Name 


Price 


Accessibility 


Facilities 


Student Hostel I. 


30 


{bus} 


{} 


Student Hostel II. 


30 


{train} 


{sanna} 


Transatlantic Motel 


60 


{bus} 


{golf, sea} 


Seaside Hotel 


800 


{plane, train} 


{golf, tennis, sauna, sea} 


Golden Beach 


1000 


{plane, ship} 


{casino, golf, tennis, sauna, sea} 



a) Sample data elements 



30i — » SH1,SH2 



Y 

60i — > TM 



800i — > SH 
lOOOi — > GB 



{bus}: — > SH1,TM 



SHI 



{train}i — > SH2 SH2 < — i {sauna} {golf, sea}i — > TM 



{plane, train}i — > SH {golf, tennis, sauna, seaji — > SH 



{plane, shipji — > GB {casino, golf, tennis, sauna, seaji — > GB 



b) Catalogue for price c) Catalogue for accossability d) Catalogue for facilities 
Fig. 1. Sample data and catalogues for a travel agency 



3 Space Allocation 



In the rest of the paper let k stand for the number of fields for which auxiliary 
structures can be built. Let ni,. . . ,Uk denote the cardinality of the 1 , . . . ,k 
field respectively. The number of all data elements equals to N (ni , . . . ,Uk < N). 
Furthermore, we adopt two commonly used symbols to specify computational 
complexity. As usual, 0(X) means that the complexity is at most ciA, while 
0{X) means that the complexity lies between C 2 X and c^X, where Ci, 02,03 are 
constants. 

If hashing is used as a speed-up method, the whole structure occupies 0{N) 
space irrespectively of the cardinality of fields. 

If indexes are employed, the size (the number of nodes) of each index tree is 
proportional to rii. This is a consequence of the fact that an index tree is always 
balanced and the number of pointers starting from a node is limited. Obviously, 
leaves have variable length as they store reference to each data element which 
contain the value in question. But the total length of leaves is 0{N). 

Although catalogues also have rii nodes, there is a substantial difference: they 
may contain 0{nf + N) pointers in total as proven next. 

Example 3 (‘Dense’ catalogue). Consider the sets of and |"^] nodes where 
each node in the former is connected to each node in the latter and the rest of 
pointers determine which data elements contain each value. 
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The quadratic limit signifies that a catalogue can occupy much more space 
than an index. Example 1 demonstrates, however, that the subset relation plays 
a distinguished and important role in simple catalogues, i.e. in catalogues built 
on a single attribute. The reason for that is that the subset relation is a natural 
pre order on the power set of the values if no other pre order is identifiable on the 
domain. Hence, probably the most common catalogues are based on the subset 
relation, and the following calculation gives a favourable upper limit on their 
space allocation. 

Example 4 (Catalogue of all subsets). The catalogue of all possible subsets of 



inter-node edges, i.e. its space requirement is linearithmic. 

4 Time Complexity 

We need an additional parameter for time analyses. Directly or indirectly all 
operations on physical structures are given a value for the i field, for which 
the structure in question is built. The number of data elements storing the same 
value as specified by the argument is denoted by W (W < N). Mi stands for 
the size of the result set if it does not equal to Ni. 

4.1 Lookup 

As well known, the hash-based search method locates the desired data elements 
in a single step (assuming a ‘good’ hash function), then only the dereferencing 
of pointers is needed which takes 0{Ni) time. 

Indexes are more robust then hashes in the sense that they do not depend on 
non-data parameters such as the selection of a function, but they theoretically 
can not operate as fast as hashes: a lookup always consumes 0{logrii + Ni) time. 

Before analysing catalogues, we should exactly define the lookup method. 
For this the definition of a new concept, a relation and the introduction of a 
notation are necessary. 

Definition 2 (Entry point). A node which has no predecessor is considered 
an entry point. 

Parts b, c and d of Fig. 1 illustrate catalogues with 1, 3 and 1 entry point 
respectively. 

Definition 3 (Value equivalence), u is equivalent to w denoted by u = w if 
and only if u ';f,w and w 



the set containing I elements has 



L = 2‘ 



nodes and 
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It should be noted that although in the examples of Fig. 1 value equivalence 
is the same as value identity, they are distinct notions in general. 

Originally, both ^ and = are relations over values. From now on, we apply 
them to nodes as well. It is interpreted as any value of the node was given as 
the argument. This definition is unambiguous because of the properties of the 
relations and the way catalogues are constructed. 

The lookup procedure is specified as follows. Its correctness is proven in [8]. 

Algorithm 1 (Search in a Catalogue). 

simpleSearch(ui) — find all data elements containing Vi in the i field 

1. Let node be an unprocessed entry point, if any, otherwise go to step 5. 

2. result := simpleSearchNode(node, Vi) 

3. If result yf ‘not found’, return the data elements belonging to result. 

4. Go to step 1. 

5. Return ‘not found’. 

simpleSearchNode(node, Vi) 

1. If Vi =i node, return node. 

2. Let next Node be any of the direct successors of node, for which 

nextNode Vi 

holds. If there is no such nextNode, return ‘not found’. 

3. Call and return simpleSearchNode(nea;Wode, Vi). 

This implementation of lookup runs in 0(ci + di + Ni) time, where is the 
number of entry points and di is the length of the longest path in the cata- 
logue. (Note that ei+di < ni always holds.) It means that with respect to this 
operation, it is worth building a catalogue if Ci+di <C ni only. 

Example 5 (prev. contd.). The longest path in the catalogue of all subsets is I, 
and there is a sole entry point, the empty set. Therefore, di = log 2 L and the 
time complexity of search is the same as at an index. 

4.2 Insert, Delete 

Inserting a data element does not involve any non-constant time task besides 
lookup in the case of hashes. At indexes, all levels of the tree may have to be 
re-organised. Thus, insertion takes ©(logn^ -|- Ni) time. 

In catalogues re-organisation because of a new element is also necessary. The 
new node may reside in the middle of all other nodes, requiring the modifica- 
tion of all edges to maintain the reduced property of the graph. As an extreme 
case, insertion of such a new node into the catalogue outlined in Example 3 
runs in 0{LF'), i.e. in quadratic time. The following implementation is not more 
time-consuming since it basically searches for directly preceding nodes and then 
modifies edges as needed. 
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Algorithm 2 (Insertion into a Catalogue). 

insert (tii, item) — insert data element item which has vt as value of the i field 

1. prev Nodes := 0 

2. Let node be an unprocessed entry point, if any, otherwise go to step 7. 

3. result := predecessors (node, Vi, item) 

4. If result = ‘inserted’, exit procedure. 

5. prevNodes := prevNodes U result 

6. Go to step 2. 

7. Create a new node new Node for Vi, and assign the single element item to 
it. 

8. Let prevNode be an unprocessed element of prevNodes, if any, otherwise 
exit procedure. 

9. Let nextN ode be an unprocessed successor of prevNode, if any, otherwise 
go to step 13. 

10. If newNode ;<i nextNode does not hold, go to step 9. 

11. Remove the edge between prevNode and nextN ode. 

12. Create an edge from newNode to nextN ode. 

13. Create edge from prevNode to newNode. 

14. Go to step 8. 

predecessors (node, Vi, item) 

1. If node Vi does not hold, return 0. 

2. If node =i Vi is not true, jump to step 5. 

3. If item is not already assigned to node, add item to node. 

4. Return ‘inserted’. 

5. prevNodes := 0 

6. Let nextN ode be an unprocessed successor of node, if any, otherwise return 
prevNodes. 

7. result := predecessors (nextiVode, Vi, item) 

8. If result = ‘inserted’, return result. 

9. prevNodes := prevNodes U result. 

10. Go to step 6. 

The deletion of a data element from a catalogue is conservative in the sense 
that if the element to be deleted was just inserted, the resulting catalogue is 
always the same as it was before the insertion. As a consequence, deletion has 
al least the same time complexity as insertion if catalogues are utilised. 

Algorithm 3 (Deletion from a Catalogue). 

delete(i;i, item) — delete data element item which has Vi as value of the i field 

1. Let node be an unprocessed entry point, if any, otherwise go to step 5. 

2. result := simpleSearchNode(node, Vi) 

3. If result ^ ‘not found’, go to step 6. 

4. Go to step 1. 

5. Return ‘not found’. 
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6. If item is not assigned to result, return ‘not found’. 

7. If the only data element assigned to result is item, go to step 10. 

8. Remove item from the list of data elements assigned to result. 

9. Return ‘success’. 

10. Let prevNodes be the predecessor, nextNodes the successor nodes of result. 

11. Remove all edges starting from or ending at result. 

12. For each node € prevNodes, compute the set of nodes reachable from node 
by a directed path and assign it to succ[node\. 

13. Let {prev Node, next Node) be an unprocessed pair from prevNodes x 
nextNodes, if any, otherwise return ‘success’. 

14. Create edge from prev Node to nextNode if nextNode ^ succ[prevNode\. 

15. Go to step 13. 

The previous implementation indeed runs in 0{nf) time. Only Step 12 needs 
further attention, but it is easy to see that it can be executed for all nodes of 
the catalogue in one step by applying a graph-traversal algorithm. The running 
time of such algorithms is proportional to the number of vertices and edges [9], 
hence it is at most quadratic in the number of nodes. 

Hash structures are also conservative. Trees are not, but it is true that no 
more than O (log ni) re-structuring modifications are needed after the naive dele- 
tion. 

4.3 Additional Lookup Operations 

Apart from the update of data elements, which requires a deletion/insertion pair 
in general, all conventional operations have been discussed. Since all these oper- 
ations must be supported by all physical organisation structures, the usefulness 
of a structure is rather determined by the other facilities it provides for applica- 
tions. In accordance with the Introduction, the most important operations and 
their complexity are discussed next. 

Data elements which meet a minimum requirement are easily retrievable from 
any index but not from hashes. Analogously, a maximum requirement can also be 
specified. These methods are called lower and upper bound queries respectively 
and can be implemented for catalogues as follows. 

Algorithm 4 (Lower Bound Search in a Catalogue). 

lower Search(ui) — find all data elements containing at least Vi in the i**' field 

— For each entry point as node, call lowerSearchNode(nocie, Vi). 

lowerSearchNode(node, Vi) 

L If Vi node, call lowerSearchAll(noc?e) and return. 

2. Call lowerSearchNode(nextAo(ie, Vi) with all successors of node as 
nextNode for which 

nextN ode vt. 



lower SearchAll(no(ie) 
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1. Emit the elements of node as result. 

2. Call lowerSearchAll(nea:tA^ode) with all successors of node as nextN ode. 
Algorithm 5 (Upper Bound Search in the Catalogue). 

upper Search(rij) — find all data elements containing at most Vi in the field 

— For each entry point as node, call upperSearchNode(no(ie, Vi). 
upperSearchNode(node, Vi) 

1. If node Vi does not hold, return. 

2. Emit the elements of node as result. 

3. Call upperSearchNode(nea;tiVo(ie, Vi) with all successors of node as 
next Node. 

Because interval queries inherently require the scan of a part of data elements, 
namely the ones to be returned, the cost of scanning usually dominates the cost 
of the operation both at indexes and catalogues. But the upper bound search 
in a catalogue has the advantage of delivering partial results while processing, 
not all hits once at the end. Furthermore, it should be noted that catalogues 
are more flexible since they allow the use of pre orders, not only total orders as 
indexes do. 

Value-based neighbouring relationship cannot be interpreted in hash struc- 
tures, and it is in general of no interest in indexes. Catalogues, by means of pre 
orders, enable the modelling of more complex, more realistic situations, where 
the relationship of being a neighbour is ambiguous. In accordance with com- 
mon requirements as demonstrated in Example 1, we propose posing similarity 
and not neighbouring queries to catalogues. It makes difference only if the given 
value is contained in the catalogue, then similarity responds with data elements 
containing the specified value. (In the other case data elements assigned to nodes 
which would be neighbours if the value were represented are returned.) 

Algorithm 6 (Similarity Search in a Catalogue). 

similarSearch(r;i) — find all data elements containing similar value to Vi in the 
field 

— For each entry point as node, call similarSearchNode(no(ie, null, Vi). 
similarSearchNode(node, prevNode, Vi) 

1. If node Vi does not hold, jump to step 10. 

2. Select all nextN ode successors of node into nextLess for which 

nextN ode Vi. 

3. If nextLess = 0, jump to step 6. 

4. For each nextN ode ^nextLess call similar Sear chN ode{nextN ode, node, Vi). 

5. Return 
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6. If 

Vi node A prevNode ^ null 



holds, emit the elements of prevNode as result. 

7. If 



Vi node 



is not true, emit the elements of node as result. 

8. Emit all elements of next Node successors of node as result for which 



Vi ~^i nextN ode. 



9. Return 

10. If Vi node, emit the elements of node as result. 

Like the classical search method, this implementation runs in 0{ei + di + Mi) 
time, where Ni is replaced by the quantity Mi. Parts of the result are delivered 
here before completing the operation, too. 



5 Conclusion 

The main goal of the paper has been to verify that the use of catalogues in 
the physical layer of a database is feasible and beneficial. Clearly, catalogues 
based on pre orders provide better support for modern database applications 
by enabling the database manager to answer lower-bound, upper-bound and 
similarity queries even if it is ambiguous. 

We have analysed the memory footprint of the catalogue structure and the 
time complexity of conventional operations (data element lookup, insertion and 
deletion) in terms of block accesses. The results of complexity comparison with 
the commonly used physical organisation techniques show that either parallel 
operation of catalogues and classical methods is required, or obvious improve- 
ments that are applicable to the algorithms (such as parallelisation, traversing 
common subtrees only once, or processing subcatalogues earlier which have de- 
livered more hits so far) should be employed. 

In our future endeavour we will verify that comparison time has no signifi- 
cant effect on performance and consider the aforementioned modifications. Based 
on the currently existing prototype implementation a test environment will be 
set up so that we can analyse how many percent of real-life problems provide 
appropriate ordering relations which are applicable for catalogue construction. 
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Abstract. Web caching keeps single Web objects ready somewhere in 
caches in the user-to-server path, whereas database caching uses full- 
fledged database management systems as caches to adaptively maintain 
sets of records from a remote database and to evaluate queries on them. 
Using so-called cache groups, we introduce the new concept of constraint- 
based database caching. These cache groups are constructed from param- 
eterized cache constraints, and their use is based on the key concepts of 
value and domain completeness. We show how cache constraints affect 
the correctness of query evaluations in the cache and which optimiza- 
tions they allow. Cache groups supporting practical applications must 
exhibit controllable load behavior for which we identify necessary con- 
ditions. For such safe cache groups, the cost trade-off for record loading 
and predicate evaluation saving has to be observed during their design. 
Therefore, we analyze their load overhead and propose a population es- 
timation algorithm to be used for a cache group advisor. 



1 Introduction 

Transactional Web applications (TWAs) in various domains (often called e*- 
applications) dramatically grow in number and complexity. At the same time, 
each application faces increasing demands regarding data volumes and workloads 
to be processed efficiently. In such situations, caching is a proven concept to 
improve response time and scalability of the applications as well as to minimize 
communication delays in wide-area networks. For this reason, a broad spectrum 
of techniques has emerged in recent years to keep static Web objects (like HTML 
pages, XML fragments, or images) in caches in the user-to-server path (client- 
side caches, proxies of various types, CDNs). 

As the TWAs must deliver more and more dynamic and frequently updated 
content, this so-called Weh caching [9,11] should be complemented by techniques 
that are aware of the consistency and completeness requirements of cached data 
(whose source is dynamically changed in backend databases) and that, at the 
same time, adaptively respond to changing workloads. Attempts targeting these 
objectives are called database caching, for which several different solutions have 
been proposed in recent years [2,3,4]. Currently many database vendors are de- 
veloping prototype systems or are just extending their current products [8,10]. 
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Web clients application frontend backend 

logic DB servers DB server 



Fig. 1. Database caching for Web applications 



What is the technical challenge of all these approaches? When user requests 
require responses to be assembled from static and dynamic contents somewhere 
in a Web cache, the dynamic portion is generated by a remote application server, 
which in turn asks the backend DB server for up-to-date information, thus caus- 
ing substantial latency. An obvious reaction to this performance problem is the 
migration of application servers to data centers closer to the users: Figure 1 il- 
lustrates that clients select one of the replicated Web servers “close” to them in 
order to minimize its response time. This optimization is amplified if the asso- 
ciated application servers can instantly provide the expected data - frequently 
indicated by geographical contexts. But the displacement of application servers 
to the edge of the Web alone is not sufficient; conversely it would dramatically 
degrade the efficiency of DB support because of the frequent round trips to the 
then remote backend DB server. As a consequence, primarily used data should be 
kept close to the application servers in so-called DB caches. A flexible solution 
should not only support database caching at mid-tier nodes of central enter- 
prise infrastructures [10], but also at edge servers of content delivery networks 
or remote data centers. 

Another important aspect of a practical solution is to achieve full cache trans- 
parency for the applications, i.e., modifications of the application programming 
interface are not tolerated. Such a property gives the cache manager the choice 
at run time to process a query locally or to send it to the backend DB server, 
e.g., to comply with strict consistency requirements. Cache transparency typi- 
cally requires that each DB object is represented only once in a cache and that 
it exhibits the same properties (name, type, etc.) as in the backend. 

The use of SQL implies another challenge because of its declarative and 
set-oriented nature. This means that, to be useful, the cache manager has to 
guarantee that queries can be processed in the DB cache, i. e., the sets of records 
(of various types) satisfying the corresponding predicates - denoted as predicate 
extensions - must be completely in the cache. This completeness condition, the 
so-called predicate completeness, ensures that the query evaluation semantics is 
equivalent to the one provided by the backend. 

A federated query facility [2,8] allows cooperative predicate evaluation by 
multiple DB servers. This property is very important for cache use, because 
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local evaluation of some (partial) predicate can be complemented by the work of 
the backend DB server on other (partial) predicates whose extensions are not in 
the cache. Hence, in the following we refer to predicates meaning their portions 
to be evaluated in the cache. 



2 Constraint-Based Database Caching 

We take a look at the concepts developed and realized in the DBCache project 
[2] and explore the underlying ideas. This work has lead us to a class of tech- 
niques which we term constraint-based database caching [7]. In particular, we an- 
alyze techniques which support the evaluation of specific PSJ queries (projection- 
selection-join queries) in the cache. 

For the specification of cache contents, we refer to a particular approach 
called cache groups. In short, a cache group is a collection of related cache ta- 
bles. Cache constraints defined on and between them determine the records of 
the corresponding backend tables that have to be kept in the cache. The tech- 
nique does not rely on the specification of static predicates: The constraints are 
parameterized, which makes this specification adaptive; it is completed when the 
parameters are instantiated by values of so-called cache keys. An “instantiated 
constraint” then corresponds to a predicate and, when the constraint is satisfied 
- i. e., all related records have been loaded - the predicate extension delivers 
correct answers to eligible queries. 

The key idea of constraint-based caching is to start with very simple base 
predicates (here equality predicates) and to extend them by other types of pred- 
icates (equi-join predicates in our case) in a constructive way, such that cache 
maintenance can always guarantee the presence of the corresponding predicate 
extensions in the cache. Hence, there are no or only simple decidability prob- 
lems whether a complete predicate evaluation can be performed. Only a simple 
probe query is required in the cache at run time to determine the availability 
of eligible predicate extensions. Furthermore, because all columns of the corre- 
sponding backend tables are kept in the cache, all project operations possible in 
the backend can also be performed in the cache thereby enabling PSJ queries. 
Since full DB functionality is available, the results of these queries can further 
be refined by operations like group-by, having, or order-by. 



2.1 How Do Cache Groups Work? 

As introduced above, a cache group is a collection of related cache tables. For 
simplicity, the names of tables and columns are identical in the cache and in the 
backend DB. Considering a cache table S, we denote by S'b its corresponding 
backend table, by S.c a column c of S. Note, a cache usually contains only 
subsets of records pertaining to a small fraction of backend tables. Its primary 
task is to support query processing for TWAs which typically contain up to 3 
or 4 joins [2]. Hence, we expect the number of cache tables - featuring a high 
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degree of reference locality - to be in the order of 10 or less, even if the backend 
DB consists of hundreds of tables. 

If we want to be able to evaluate a given predicate in the cache, we must keep 
a collection of records in the cache tables such that the completeness condition 
for the predicate is satisfied. For simple equality predicates like S.c = v this 
completeness condition takes the shape of value completeness. 

Definition 1 (Value completeness, VC). A value v is said to be value com- 
plete in a column S.c if and only if all records of Gc=vS^ are in S. 

If we know that a value v is value complete in a column S.c, we can correctly 
evaluate S.c = v, because all rows from the corresponding backend table that 
carry that value are in the cache. But how do we know that v is value complete? 
This is easy if we maintain domain completeness of specific table columns. 

Definition 2 (Domain completeness, DC). A column S.c is said to be do- 
main complete (DC) if and only if all values v in S.c are value complete. 

Given a domain-complete column S.c, if a probe query confirms that value v 
is in S.c (a single record suffices), we can be sure that v is value complete and 
thus evaluate S.c = v in the cache. Note that unique (U) columns of a cache 
table (defined by SQL constraints “unique” or “primary key” in the backend 
DB schema) are DC per se {implicit domain completeness). Non-unique (NU) 
columns in contrast need extra enforcement of DC. 

So far, we can evaluate only equality predicates, i. e., simplest selection 
queries, in the cache. To enhance such queries with equi-join predicates, we 
introduce referential cache constraints (RCCs), which guarantee the correctness 
of equi-joins between cache tables. Such RCCs are specified between two cache 
table columns: a source column S.a and a target column T.b. The tables S and T 
need not be different, not even the columns themselves. 

Definition 3 (Referential cache constraint, RCC). RCC S.a T.b be- 
tween columns S.a and T.b is satisfied if and only if all values v in S.a are value 
complete in T.b. 

RCC S.a — >■ T.b ensures that, whenever we find a record s in S, all join 
partners of s with respect to S.a = T.b are in T. Note, the RCC alone does not 
allow us to correctly perform this join in the cache: Many rows of S'b that have 
join partners in Tb may be missing from S. But using an equality predicate on 
a DC column S.c as an “anchor”, we can restrict this join to records that exist 
in the cache: RCC S.a ^ T.b expands the predicate extension of S'. c = a; to the 
predicate extension of {S.c = x and S.a = T.b). In this way, DC columns serve 
as entry points for queries. 

Domain completeness of a column S.c is equivalent to an RCC S.c — >■ S.c, a 
so-called self-RCC on its defining column S.c. By specifying such a self-RCC, 
the DBA can enforce domain completeness of S.c and thus create an entry point 
for query evaluation explicitly. 
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How do the records that constitute a predicate extension get into the cache? 
And how are these predicate extensions actually chosen? For these tasks, we 
introduce the second kind of cache constraint, the so-called cache key. 

Definition 4 (Cache key). A cache key column S.k is always kept domain 
complete. Only values in initiate cache loading when they are referenced 

by user queries. 

You can imagine that the specification of a cache key includes a self-RCC; 
similar to it, a cache key can always be used as an entry point. (In both cases 
the columns get explicitly domain complete.) But in addition, a cache key serves 
as a filling point for a distinguished root table R (the only table in a cache group 
that contains cache keys) and - via the (paths of) RCCs specified between R 
and related cache tables - for the member tables of the cache group. Whenever 
a query references a particular cache key value that is not in the cache, the 
backend DB must evaluate this query. But as a consequence of this cache miss 
attributed to a cache key, the cache manager satisfies the value completeness for 
the missing cache key value by fetching all required records from the backend and 
loading them into the cache table R (thus keeping the cache key column domain 
complete). To satisfy the RCCs, the member tables of the cache group are loaded 
in a similar way (details are provided in [2]). Hence, a reference to a cache key 
value X serves as something like an indicator that, in the immediate future, 
locality of reference is expected on the predicate extension determined by x. 
Cache keys therefore carry information about the future workload and sensitively 
influence caching performance. Hence, DBAs must select them carefully^. 

2.2 Types of RCCs and Their Use in Cache Groups 

Depending on the types of the source and target columns, we classify RCCs as 
(1: 1), (1: n), (n: 1), and (n: m) and denote them as follows: 

— U — >■ U or U — >■ NU: member constraint (MC) 

— NU — >■ U: owner constraint (OC) 

— NU — >■ NU: cross constraint (XC). 

Using RCCs we implicitly introduce something like a value-based table model 
intended to support queries. Despite similarities, MCs and OCs are not identi- 
cal to the PK/FK (primary key/foreign key) relationships in the backend DB: 
Those can be used for join processing symmetrically, RCCs only in the specified 
direction. XCs have no counterparts at all. Because a high fraction of all SQL 
join queries refers exclusively to PK/FK relationships - they represent real-world 
relationships captured by the DB design -, almost all RCCs are expected to be 
of type MC or OC; accordingly XCs and multiple RCCs ending on a NU column 
seem to be rare. 

^ Low-selectivity cache key columns may cause cache filling actions involving huge sets 
of records never used later. It may therefore be necessary to control the use of cache 
key values with stop-word or recommendation lists. 
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Fig. 2. Cache group OP for order processing 



Assume a cache group OP with tables C, O, and P and cache key C.t, formed 
by C.a — ?► O.b and O.c — >■ P.d, where C.a and P.d are U columns and C.t, 
O.b and O.c are NU columns (Fig. 2). In a common real-world situation, C,0, 
and P could correspond to backend DB tables Customer, Order and Product. 
Hence, both RCCs would typically characterize PK/FK relationships that will be 
used for join processing in the cache. Additional RCCs, for example, C.t — >■ O.b 
or O.c — > C.n, are conceivable; such RCCs, however, have no counterparts in 
the backend DB schema and, when used for a cross join of C and O, their 
contributions to the query semantics remain in the user’s responsibility. 

As we know, if a probing operation on some domain-complete column T.c 
identifies value x, we can use T.c as an entry point for evaluating T.c = x. Now, 
any enhancement of this predicate with equi-join predicates is allowed if these 
predicates correspond to RCCs reachable from cache table T. 

Assume, we find ‘gold’ in C.t (of cache group OP), then the predicate {C.t = 
‘gold’ and C.a = O.b and O.c = P.d) can be processed in the cache correctly. Be- 
cause the predicate extension (with all columns of all cache tables) is completely 
accessible, any column may be specified for output. Of course, a correct predicate 
can be refined by “and-ing” additional selection terms (referring to cache table 
columns) to it; e. g. {C.t = ‘gold’ and O.c > 42 and C.n like ‘Smi%’ and . . . ). 

3 Cache Group Design and Analysis 

At this point, we know how to configure a cache group by specifying the partici- 
pating tables, the RCCs connecting them, and the cache keys, which initiate the 
population of the cache group. We can use domain-complete columns as entry 
points to obtain correct query results for eligible query predicates. Is this all we 
need to know to design and to effectively make use of cache groups? 

On the one hand, a cache group should enable as flexible use for predicate 
evaluation as possible: We should not leave any entry point or RCC unexploited. 
This requires that we know about all of them, not just about those we specified 
explicitly. On the other hand, we want to design safe cache groups which exhibit 
controllable load behavior. 

Definition 5 (Reachability graph). A reachability graph j is a directed graph 
implicitly defined for a cache group C. It has C ’s tables and RCCs as nodes and 
edges. 7 is composed by starting from G ’s root table and following all RCCs 
transitively thereby connecting all reachable tables (as nodes ofj). 
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Definition 6 (Paths and cycles). A path in a reachability graph starts at a 
source table and ends at a sink table. It connects a collection of cache tables via 
a sequence of RCCs. No RCC may appear twice in a path. A cycle is a path that 
starts and ends at the same table. 



Definition 7 (Homogeneous and heterogeneous cycles). A cycle is homo- 
geneous, if only a single column per table is involved, heterogeneous otherwise. 

Heterogeneous RCC cycles can lead to excessive population of cache groups 
primarily caused by recursive filling actions. Such “dangerous” load behavior 
must clearly be identified and prevented. 

3.1 Entry Points for Query Evaluation 

So far, we have argued that a cache table column can be tested and used by an 
equality predicate correctly only if it is domain complete. But how do we know 
that? Of course, cache table columns that carry either a self-RCC or a cache key 
(i. e., at least all filling points) are explicitly domain complete; columns of type U 
are implicitly domain complete. Cache-supported query evaluation gains much 
more flexibility and power, if we can correctly decide that other cache columns 
are domain complete as well. 

Let us refer again to OP. Because C.a — >■ O.b is the only RCC that in- 
duces filling of O, we know that O.b is domain complete (denoted as induced 
domain completeness). Hence, we can correctly evaluate the query predicate 
(O.b = y and O.c = P.d) if we encounter value y in O.b - in addition to 
(C.a = X and C.a = O.b and O.c = P.d) if value x is in C.a. 

Note, additional RCCs ending in O.b would not destroy the DC of O.b, though 
any additional RCC ending in a column different from O.b would do^: Assume an 
additional RCC ending in O.e induces a new value v, which implies the insertion 
of cre=„OB into O - just a single record o. Now a new value w of o.b, so far 
not present in O.b, may appear, but all other records of ab^wOs fail to do so. 
For this reason, a cache table filled by RCCs (or cache keys) on more than one 
column cannot have an induced DC column. This means that induced DC is 
context dependent; in contrast to explicit or implicit DC (i.e., DC of cache key, 
self-RCC, or U columns), it can be lost when a cache group configuration is 
modified. This leads us to the following definition: 

Definition 8 (Induced domain completeness). A cache table column is in- 
duced domain complete, if it is the only column of a cache table filled^ via one 
or more RCCs or via a cache key definition. 

Let us summarize our findings concerning the population of cache tables and 
the domain completeness of their columns: A cache table T can be filled via 

^ We must distinguish between RCCs that only reach a column and RCCs that fill it: 
RCCs that never cause any record to be loaded (e. g., a self-RCC on a U column) do 
not disturb induced DC. How to effectively classify an arbitrary RCC is unsettled. 
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cache key columns or RCCs ending in one or more of its columns. A column 
of T can be domain complete due to specifications in the backend (implicitly: 
U columns), due to specifications in the cache (explicitly: cache keys, self- RCCs), 
or as a result of the interaction of specified items (induced). 

Analogous to extra DC columns, one can discover optimization RCCs in a 
cache group, i. e., RCCs that have not been specified, but hold in the given 
context. For example, in OP the (optimization) RCC O.b — >■ C.a allows an 
additional join direction. 

3.2 Safeness of Cache Groups 

It is unreasonable to accept all conceivable cache group configurations, because 
cache misses on cache key columns may provoke unforeseeable load operations. 
Although the cache can be populated asynchronously to the transaction ob- 
serving the cache miss, avoiding a burden on its response time, uncontrolled 
loading is undesirable: Substantial extra work, which can hardly be estimated, 
may be required by the frontend and backend DB servers, which will influence 
the transaction throughput in heavy workload situations. 

Specific cache group configurations may even exhibit a recursive loading be- 
havior that jeopardizes their caching performance. Once cache filling is initiated, 
the enforcement of cache constraints may require multiple phases of record load- 
ing. Such behavior typically occurs, when two NU-DC columns a and & of a 
cache table X must be maintained. A set of values appears in a, for which X is 
loaded with the corresponding records of Xb to keep column a domain complete. 
These records populate b with a set of (new) values that have to be made value 
complete, which possibly introduces new values into a again. As a result, a and 
b may receive new values in a recursive way. 

Cache groups are called safe if no recursive load behavior is possible. Upon 
a cache key miss, we want the initiated cache loading to stop after a single pass 
of filling operations through the tables of the cache group. 

Obviously, recursive loading requires that there is a cyclic structure among 
the specified RCCs (remember, every cache key also contains an RCC). Simple 
examples show that there are not only unsafe RCC cycles, but also safe ones 
(consider a homogeneous cycle) [2,7]. We analyzed cycles in detail and derived 
safeness conditions for cache group configurations. These conditions are more 
sophisticated than a simple exclusion of pairs of NU-DC columns (as sketched 
above), because the mutual introduction of new values can span several tables 
and can also be neutralized by compensating effects. Nevertheless the safeness 
conditions can be stated as a single rule that requires the designer of a cache 
group to inspect all contained cycles for certain patterns of U and NU columns. 



4 Loading Behavior of Cache Groups 

So far, we have discussed the correctness and safeness conditions of cache groups. 
To analyze the loading behavior, we will derive a simple quantitative cost model. 
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It is aimed at estimating the approximate cost (depending on column selectivities 
and RCCs specified) for the population of a cache group caused by a single cache 
key value. 



4.1 Model Assumptions 

For quantitative modeling, we generally assume uniform value distribution in 
columns and stochastic independence between columns (i. e., the standard as- 
sumptions of query processing). Each column of a cache table T inherits the 
cardinality of the corresponding column of backend table T^. Hence, T.j has 
cardinality cx.j (i.e., it has up to CT.j distinct values). We define the selectiv- 
ity of column T.j to be sr.j = Thus, the smaller the value ST.j^ i.e., 

the larger the value ct.j, the higher is the column selectivity. For example, if T 
contains Nt records, an equality predicate on T.j qualifies Nt ■ st.j = Nt/ct.j 
records (1 < ct.j < Nt) implying that NULL values are excluded. 

When riT records are filled into a table T, e. g., to satisfy a cache constraint on 
a given column T.i, how many distinct values d are entered into a stochastically 
independent column T.j? The result for the boundary values is obvious: If T.j is 
unique, d = tit, and if the cardinality of T.j is 1, d = 1. In general, an accurate 
determination of d demands for a stochastic model which evaluates the expected 
number of distinct values of T.j [12]. In abstract terms: Given natural numbers 
N,c,n{l<c<N,n< N), what is the expected number d of colors when n balls 
are drawn without replacement from a bucket with N balls. These balls occur 
in c different colors and are uniformly distributed, i.e., there are N/c balls per 
color. The following model, which we have derived for the sketched situation, is 
used throughout the paper and referenced by /(iV, c,n): 




In frequent situations, more than one record set is independently filled into 
table T. Instead of computing the sum of the various set sizes, we could im- 
prove our population estimation by modeling such situations of a combination 
of events. Then the expected size ut of T’s population induced by m indepen- 
dent cache constraints could be calculated with the following considerations. If 
Ai, . . . , Am are m events, what is the probability of the event that at least one 
among the m Aj occurs. In symbols, this event is Ai m = AiUA 2 U- • -UAm- It is 
not sufficient to know the probabilities of the individual events Aj , but we must 
have complete information concerning all possible overlaps. Fortunately, due to 
the stochastic-independence assumption, we can easily compute for each pair 
(i,j), each triple (j, j, 1), etc., the probability of events Ai fl Aj or Ai fl Aj fl A;, 
etc. Furthermore, we can compose our formula iteratively, thereby computing the 

probabilities of the following events [5]: (Ai U A 2 ), (A 12 U A 3 ), (A 123 U A 4 ), 

This abstract model for the combination of events can easily be applied to 
our problem of determining the number of distinct records when independent 
record sets are filled into a table. By multiplying the (filling) probabilities with 
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table cardinality Nt, we immediately yield nTi...m as the number of distinct 
records for m overlapping record sets. If titi and tit 2 records are to be filled 
independently into table T, titi 2 is the number of records actually loaded, etc.: 

nTi2 = nTi + riT2 ~ ^ti • riT2/NT , 
nTi23 = nTi2 + nT3 — nTi2 ■ nTs/Nx , 

?^T1234 = nTl23 + nn — riTl23 ■ riTi/NT , ■ . . 

The rationale of our somewhat simplified estimation model, the calculation 
of average record populations in cache tables, is considered to be a great help for 
cache group design, e. g., when applied by a cache group advisor. A model refine- 
ment is only possible at the expense of substantially increased model complexity. 
The actual value distributions in columns and the size of record sets induced by 
RCCs (equivalently, (intermediate) join results) could be approached more accu- 
rately by introducing histograms, describing the frequency of individual values 
or of values belonging to value ranges, and join selectivities.^ This would require 
additional and more accurate statistical data for cache tables and cache groups 
which is, due to its dynamic nature, hard to derive and to maintain. 

4.2 Effective Cache Keys and Applicable RCCs 

To keep the population model simple, but at the same time as accurate as 
possible, we need the “right” concepts. As argued in the following, two essential 
concepts for the population estimation of a cache group G are applicable RCC 
and effective cache key. 

Any filling action in a cache group is path-dependent and depends on the type 
of RCC traversed. For example, an optimization RCC does not change G’s filling 
and need therefore not be considered for G’s population estimation. (In Fig. 3, 
Q.e — >■ O.a and V.g O.a would be such optimization RCCs.) Otherwise, we 
would have to deal with MC — >■ OC cycles (the reverse owner constraint for an 
already traversed member constraint) adding unnecessary complexity. In Fig. 3, 
all four RCCs shown are applicable for the population estimation. 

Conversely, if a reverse member constraint is specified explicitly, this RCC is 
considered a design error. The resulting homogeneous cycle would only lead to 
excessive load situations without benefiting the transaction’s queries. 

When more than one cache key is specified for a root table, we can always 
reduce such a set of cache keys to a single effective cache key (cfces) as far as cost 
estimations for the filling process are concerned.'^ ck^s is the only non-unique 

® However, this would only improve certain situations. Since values in a column, dis- 
tributed according to a histogram, are used in RCCs which enforce the filling of 
these values in columns of other tables, the model complexity seems to be very hard 
to control. 

Consider two cache keys, T.u unique and T.n non-unique. If a value of u causes a 
cache miss, the single qualified record is loaded into the cache. The new value of 
T.n has to be made value complete which determines the set of records to be loaded 
(except in the case of a NULL value in T.n, which we exclude from our estimations). 
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Fig. 3. Cache group G 

cache key, if any; otherwise, it is any unique cache key. When O.a and O.k are 
defined as cache keys in Fig. 3, then O.k is the effective one. 

The column ckeS determines the number ns of records to be loaded into 
the root table S upon any cache miss, which is caused by probing an equality 
predicate on a cache key column of S. Furthermore, it guarantees that domain 
completeness for all cache key columns of S is satisfied after loading these ns 
records. In fact, for the computation of ns, only the cardinality of the ckgg 
column matters (together with Ns, the cardinality of 5 ' b ). Therefore, we always 
deal with ckgg in the following. 

A computation step considers a source table S - for the initial step, the root 
table -, an applicable RCC R, and a target table T. Given Ns and Nt, the 
expected filling size nx of T enforced by R can be derived from ns and from the 
cardinalities of the columns connected by R. In a subsequent step, table T may 
become the new source table S' , and an outgoing RCC together with its target 
table T' is selected as the component under consideration. 

4.3 Population Estimation — A Simple Example 

To derive a general scheme of cache group filling, the basic step and its quan- 
titative description can be studied using the situation illustrated in Fig. 3. It 
covers the essential column and RCC types: O.k labels a NU (effective) cache 
key, whereas O.a and O.b denote U and NU (non-domain-complete) columns. 

In the following, a new value v of cache key O.k is assumed to be filled into 
cache table O, enforcing the load of no.k records of type O to guarantee DC of 
O.k. Because O.a is unique, the number of new O.a values is no.k- But what is 
the number no.b of O.b values that appear in the cache as a consequence of v7 
Furthermore, how many records of types Q, V, and P have to be loaded to 
satisfy the RCCs O.a — >■ Q.e, O.a — >■ V.g, and O.b — >• P.c! 

Of course, when a column is unique, its cardinality is equal to the num- 
ber of records in the corresponding backend table, e. g., co.a = Nq- Us- 
ing the uniform-value-distribution assumption, we can immediately compute 
no = no.k = No/co.k, which is the number of records of type O to be loaded. 
The computation of no.b is much more difficult and requires additional thoughts 
which led to our model f{N,c,n). Hence, by substituting no.b for d and No, 
co.b, and no in f{N,c,n), we can compute no.b = f(No,co.b,no).^ 

® The formula for f{No,co.b,no) is only one possible option. If available, any other 
(possibly more efficiently computable) approximation could act as a substitute. 
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A fundamental difficulty prohibits the more accurate approximation of the 
case where both O.k and O.b are non-unique. All the no.b values computed by 
our formula become effective for the first O.k value only. In a subsequent filling 
initiated by a new O.k value v' , some O.b values qualified by v' may already reside 
in the root table, and, therefore, only UeS < no.b values may trigger further 
fillings via RCCs. A more accurate approximation would require to consider the 
filling history of the cache group which seems to be impractical and is definitely 
beyond the rationale of our estimation model. Moreover, this case does not seem 
so important to justify additional model complexity. Therefore, we always put 
up an inequality relation in formulas where no.b is involved: UeS < no.b = 
f{No,co.b,no). 

In case of a unique source column O.a of an RCC (i.e., a membership con- 
straint), always all no. a = no values are new and lead to the loading of the 
corresponding values in the target column of the participating table (let’s say 
Q or V). Hence, always ng (ny) records of type Q {V) are to be filled into 
the cache table Q {V)\ ng = no ■ NQ/cq,e (ny = no ■ Nv/cy.g). In case of a 
non-unique source column O.b of an RCC (i.e., an OC or XC), all no.b < no 
values are assuredly new only when O is empty. In general, some of these values 
may already have been brought into the cache table by a previously referenced 
O.k value. Therefore, the resulting cache load for the target table P can only be 
estimated by nn < no.b ■ Np/cp,c = f(No,co.b,no) ■ Np/cp,c. 

As indicated in Fig. 3, P is reached by an additional RCC loading path 
O.a — >■ V.g,V.h — ?► P.d, the contribution of which has to be approximated. Ac- 
cording to our assumptions, ny records cause d = f{Nv,cv.h,nv) differ- 
ent values to be loaded into V.h, which, in turn, need d different owner 
records in P. Because there are already records in P loaded via O.b — )■ P.c, 
we may encounter some of these owner records there. Hence, we can ex- 
pect np = npi + np 2 — npi ■ np 2 /Np records to be loaded into P where 
np 2 = f{Nv,cv.h,nv) ■ Np/cp,d = f{Nv,cv.h,nv). 

4.4 General Scheme for a Single Evaluation Step 

After having discussed, by referring to the example in Fig. 3, the various param- 
eters influencing the filling of a cache group caused by a single cache key value, 
we can generalize our notation and summarize our findings. In the following, we 
use S and T for source and target table, or for a table in general. Compiled in 
Tab. 1, we have derived a general schema for determining the cache table filling 
of a single evaluation step i. There we assess the effects of a single RCC R of 
a given type U/NU — >■ U/NU.® Initiated by ns records filled into S, the listed 
value np is the expected size of the record set that is to be filled into T to satisfy 
RCC R. For the initial step (i = 1), ns is derived from the effective cache key 
of the root table. Note again, a target table becomes the source table of the 
subsequent evaluation step i + 1: R S'i+i and npi >— t ns^^^. 

® We assume a lossless join along an RCC. If, for example, an RCC connects a unique 
column of S with a unique column of T, then Ns = Np, i.e., NULL values do not 
occur in these columns. 
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Table 1. General scheme for the table population induced by an RCC 



target table filling nr 



target column T.j 
U NU 



U 

source column S.i NU, DC 

NU, non-DC 



ns ■ 1 
ns ■ cs.i/Ns ■ 1 
f{Ns, csA, ns) ■ 1 



ns ■ Nt I CT.j 
ns ■ Cs.i/Ns ■ Nt/ct.j 
f{Ns,cs.i,ns) ■ Nt/ct.j 



Whether a non-unique column S.i becomes domain complete, is context de- 
pendent (see Sect. 3.1). If it is domain complete, all its values cause new record 
sets of size Ns/cs.i to be loaded into S. 

5 Evaluation of Single Cache Groups 

So far, we have considered the effect of a single RCC on cache group filling. 
Note, since ns and nr are context dependent, it is not sufficient to sum the 
individual RCC filling results. Starting from an empty cache group G, our next 
goal is the estimation of G’s population induced by a single cfces value. This 
estimation is an upper bound (in the average-case sense) for subsequent fillings 
due to a further cfceff value, because, in case of a NU, non-DC source column S.i 
of an RCC, some of the values estimated by f{N,c,n) may already reside in 
column S. Hence, only some of these values may lead to a filling activity in the 
target table. The effective set of values is usually smaller than estimated by our 
formula - a rare situation that, however, has to be taken into account due to 
limited model accuracy. 

We propose a population-estimation algorithm PE that refers to G’s reach- 
ability graph 7 built from its applicable RCCs only; we assume that 7 is cycle 
free. PE starts at the given filling point and computes - once and for all - the 
number of records ns to be filled into the root table. Each member table T has 
m > 1 incoming RCCs originating from source tables S\, . . . , Sm- In order to 
avoid multiple evaluation of T’s outgoing RCCs (for each incoming RCC sepa- 
rately), we need the expected size nxact of T’s population based on all incoming 
RCCs, before we compute the populations of tables directly reachable from T.^ 
Furthermore, in order to compute T’s table population at once, we must know 
n-Siact of all source tables Si. Since this rule applies to all Si in their role as 
a target table as well, we have to compute the individual table populations in 
cache group G in such an order that, for the estimation of each T, the estimated 
populations of all Si are already known. In other words, we have to perform a 
topological sort TS of G’s reachability graph® to determine the order in which 

^ Although G is assumed to be cycle-free, this is the reason why a single traversal (e. g. 
left-most depth-first) of G’s reachability graph is not sufficient for the population 
estimation. 

® Since a topological sort detects cycles, it is a consistency check whether the PE 
algorithm is applicable to G’s reachability graph. 
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1. Using its reachability graph, list the topological sort order of G’s tables in TSO. 

2. Visit the first (root) table S in TSO and compute the expected population ns 
using the cardinality of cfcefi. 

3. While there are tables in TSO not visited, visit the next table T and obtain the 
expected table population ut- 

a) For each incoming RCC Ri = Si.a ^ T.bi, compute the population nn 
using the already computed ns^act of its source table Si, the type of Ri, 
and the cardinality of its target column cr.bi- 

b) UTact is obtained by applying the combination-of-events model using all 
determined UTi- 



Fig. 4. Algorithm PE, estimating the population of a cache group G caused by a 
reference to a single cfeeff value. 



the table populations can be computed. Since only the root table of G lacks 
incoming arcs, it is the starting point of TS. 

Knowing ns^act for all source tables S.i connected to T via incoming RCCs, 
we can apply the appropriate formula of Tab. 1 and compute the population sizes 
^Ti (* = 1, . . . , m) expected from each of the m RCCs. Since the corresponding 
record sets are considered stochastically independent, we can eliminate the ex- 
pected duplicates from our population estimation by computing riTact = nTi...m 
as sketched in Sect. 4.1. Figure 4 summarizes the steps of PE. 

6 Conclusion and Future Work 

We have introduced constraint-based database caching using as an example a 
specific kind of cache groups tailored to PSJ queries, which frequently occur in 
TWAs. Cache groups provide predicate completeness for predicates built con- 
structively from simple base predicates, which are specified as parameterized 
constraints on cache tables. This use of parameters gives cache groups a simple 
kind of adaptability. 

The analysis of the basic type of cache groups has shown that one must 
be aware of the consequences of a set of specified cache constraints: On the 
one hand, performance problems due to uncontrolled cache loading must be 
prevented; on the other hand, one must know which kinds of predicates can 
be evaluated correctly in the cache and must have efficient probe operations to 
check the availability of predicate extensions. Furthermore, for each variation of 
constraint-based caching, quantitative analyses must help to understand which 
cache configurations are worth the effort. Therefore, we have developed the basic 
principles to quantitatively estimate the loading costs of a given cache group 
configuration. 

Our framework can be used for the design of a cache group advisor support- 
ing the DBA in the specification of a cache group, when the characteristics of the 
workload are known. Then, the expected costs for cache maintenance and the 
savings gained by predicate evaluation in the cache can be determined thereby 
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identifying the trade-off point of cache operation. For example, starting with 
the cache tables and join paths exhibiting the highest degrees of reference lo- 
cality, the cache group design can be expanded by additional RCCs and tables 
until the optimum point of operation is reached. Such a tool may also be useful 
during cache operation by observing the workload patterns and by proposing 
or automatically invoking changes in the cache group specification. This kind 
of self-administration or self-tuning opens a new and complex area of research 
often referred to as autonomic computing. 

There are many other issues that wait to be resolved: For example, we have 
not said anything about the invalidation of predicates, about the removal of 
overlapping predicate extensions from the cache, or about different strategies 
how updates can be applied to cache and backend DB. We also want to explore, 
how the idea of constraint-based caching can be extended to other types of 
predicates (e.g., range or aggregation predicates). 

Acknowledgments. We want to thank M. Altinel, Ch. Bornhovd, and C. Mo- 
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Abstract. Nowadays, one of the main research issues of great interest 
is the efficient tracking of mobile objects that enables the effective 
answering of spatiotemporal queries. This line of research is relevant 
to a number of modern applications spanning many contexts. In this 
paper, we consider the organization of a moving object database by 
quadtree based structures (structures obeying the Embedding Space 
Hierarchy). In this context, we adapt an indexing method, called XBR 
trees, to support range queries about the history of trajectories of 
moving objects. The XBR tree is a quadtree like external memory, 
balanced and compact structure that follows regular decomposition. 
Apart from the presentation of the new method, we experimentally 
show that it outperforms the only previous Embedding Space Hierarchy 
approach (based on PRM quadtrees) for indexing moving objects. 

Keywords: Quadtrees, Moving Objects, Spatiotemporal Queries, Spa- 
tiotemporal Databases 



1 Introduction 

In the past few years, the focus of the research in Geographic Information Sys- 
tems (GISs) has drastically evolved from traditional data management issues 
(such as modeling, indexing, querying) to new and exciting challenges raised 
by the emergence of new technologies. Two of the major recent achievements of 
these technologies, namely the World Wide Web and the development of accurate 
positioning systems, have a strong impact on GISs. In particular, positioning sys- 
tems constitute a very challenging area. The Global Positioning System (GPS) 
and the new European Galileo satellite project (its launching has been decided 
very recently, at the end of March 2002), are able to determine the position of a 
moving object with a very high precision (e.g. a few centimeters). 
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On the other hand, there undoubtedly exists a necessity for numerous appli- 
cations related to moving objects. Technologies involving mobile computing have 
appeared to show a great evolution, particularly in the last few years. Devices 
such as mobile phones and Internet terminals have become ubiquitous. 

There are also applications, which include vehicle navigation, tracking and 
monitoring, where the positions of air, sea or land-based equipment, such as 
airplanes, fishing boats and cars (e.g. taxis or ambulances) are of interest. An 
example of such applications is the tracking of fighter planes in air-force combat 
situations. Being able to correctly locate the planes (that move very fast) at a 
present time and in the near future can be used to avoid enemy targets and 
also guide the of fighter planes towards proper targets. Other real life examples 
that involve objects with positions changing over time, are traffic control, fleet 
management, fire or hurricane front monitor and weather forecast. 

The topic of querying and indexing moving objects has been addressed by 
several researchers. As far as the theoretical background is concerned, Sistla et al. 
in [11] proposed a data model, called Moving Objects Spatio-Temporal (MOST) 
model, for representing moving objects and a query language, called Future Tem- 
poral Logic. Wolfson et al. [18] addressed the uncertainty issues, which determine 
the frequency with which the database has to update the locations of the moving 
objects, in order to provide an error bound. 

Several papers have appeared that base the indexing of moving objects on 
structures that belong in the family of R-trees [4]. For example, in [10] Saltenis 
et al. proposed an R*-tree based access method (the TPR-tree) to index the 
current and future locations of moving objects aiming at handling range queries. 
Pfoser et al. [9] proposed the STR-tree as an R-tree based indexing scheme suit- 
able for storing the history of moving objects and for trajectory-based queries. 
Furthermore, the Historical R-tree was proposed by Nascimento et al. [8] as an 
indexing method for spatiotemporal data and range queries. Finally, in [19], Zhu 
et al. proposed octagon trees (OT-tree, 0-tree) as extensions to the R*-tree to 
index moving objects and handle range queries. 

All these methods are based on the concept of Object Space Hierarchy (the 
partitioning of regions depends on the data) that is followed by structures of 
the R-tree family. In this paper, we focus on methods based on the concept 
of Embedding Space Hierarchy (the partitioning of regions follows a regular 
fashion) that is followed by structures of the quadtree family. To the authors 
knowledge, the only paper that addresses the problem of indexing moving points 
by such a method is presented in [12]. In the present paper, a new such technique 
is presented and compared to the method of [12]. 

These structures allow processing of range time and space queries (e.g. which 
objects will appear in a specific area within a given time interval), or to predict 
the future position of an object, or to follow the history of the movement of an 
object. 

An alternative perspective to tackle the issue of moving objects is the use 
of transformations to index their trajectories. In [6] Kollios et al. used the dual 
transformation with a view to improve the performance during range queries. 
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Similarly, Chon et al [3], proposed the SV-model as an alternative method of 
transformation. The use of moving objects can also be applied in multimedia en- 
vironments. For example, Tzouramanis et al. in [14,15,16] presented several spa- 
tiotemporal access methods (i.e. the OLQ-trees Overlapping Linear Quadtrees 
and the MVLQ-tress Multiversion Linear Quadtrees) for storing and retrieving 
evolving raster images. 

In [5], Hadjieleftheriou et ah, suggested the Partially Persistent (PPR-tree) 
as a method for indexing and querying the history of moving objects with chang- 
ing extend (e.g. shrinking). Furthermore, the object movement was described by 
polynomial and not by linear functions and the queries examined were range 
ones. Finally, other researchers proposed the use of techniques rooted in compu- 
tational geometry (for example, in [1] external Range Trees are presented and 
use for indexing moving points). 

The indexing scheme that we propose here, the External Balanced Regular 
trees (XBR trees), is based on quadtrees, and more specifically on hierarchical 
and regular subdivision of space. The key ideas behind its design were originally 
presented in [17] for managing spatial objects, in general. In this paper, we use 
XBR-trees in the context of spatiotemporal databases. More specifically, we use 
XBR-trees to index the trajectories of moving objetcs and to answer spatio- 
temporal queries about these objects. In addition to the material appearing 
in [17], in this paper, we present a modified algorithm for splitting internal 
nodes of XBR-trees to deal with extreme conditions and we also describe the 
steps of the deletion process in XBR-trees. 

We experimentally compare the resulting method (that could be used as the 
physical layer of a Moving Objects Database) with the only analogous (quadtree 
based) method that is based on the PMR quadtree and was presented in [12] by 
Tayeb et al. An important difference between two techniques is that the indexing 
part of the PMR resides in main memory use, whereas the indexing part of XBR 
trees is a multiway disk-based tree. However, the experiments conducted in the 
present paper cannot be compared directly with the ones presented in [12], since 
they are performed under completely different conditions and assumptions (in 
[12] only the present status of moving objects is maintained, while in this paper 
the trajectory of each object through time is kept). 

The XBR trees constitute a family of new secondary memory structures, 
which are suitable for storing and indexing multi-dimensional points and line 
segments. In 2 dimensions, the resulting structure is an External Balanced 
Quadtree, in 3 dimensions an External Balances Octtree, and in higher dimen- 
sions an External Balanced Hyper Quadtree. The main characteristic of all these 
structures is that they subdivide space (in an hierarchical and regular fashion) 
into disjoint regions. These spatial access methods are fully dynamic, while in- 
sertions are not complicated to program as they affect a single tree path only. 
Moreover, XBR trees are variable resolution structures. That is, the number of 
space subdivisions is not predefined, making these structures suitable for very 
large amounts of data. Due to the balanced nature of these structures and the 
disjointness of the resulting regions, searches and other queries in these trees 
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are processed very efficiently. The interested reader will find a short qualita- 
tive comparison of XBR-trees with other well known structures, such as R-trees, 
R-l-trees, hB-trees and GBD trees, in [17] 

The rest of the paper is organized as follows. Section 2 describes the assump- 
tions made with respect to the movement of the objects. Section 3 gives a detailed 
description of the new structure and a short description of PMR quadtrees. Sec- 
tion 4 exposes the experimental results as far as query performance of the two 
trees is concerned. Finally, Section 5 presents briefly the conclusions and further 
research directions. 



2 Monitoring of Moving Objects 

We assume that time is discrete and that the location and velocity of each object 
is updated only at predefined time points that divide time in a number of time 
intervals. For each time interval of the past (up to the current time point), a 
line segment that expresses the movement of the object during this interval is 
maintained. For the interval starting at the current time point, a line segment 
that express the initial location and velocity of the object is maintained. All 
these line segments make up a polyline that expresses the trajectory of the 
object from the starting time point to the point that follows the current time 
point. Especially, the last line segment expresses not the actual trajectory, but 
the expected trajectory from the current time point to the next one. 

When time advances to the next time point, each object notifies the sys- 
tem of its actual location and velocity. With this data, the last line segment of 
the polyline is updated (meaning that, in general, the last line segment must 
be deleted and reinserted to reflect the actual data) and a new segment that 
expresses the expected trajectory from the new current time point to the next 
one is inserted. The resulting line segments are stored in the (XBR, or PMR) 
tree leaves and information guiding the search to the leaves is stored in internal 
nodes. 

This scheme aims at efficiently supporting range queries regarding the history 
of the objects movement. For example, to answer the query ‘Find all the objects 
that were positioned inside a particular area during a specific time interval’, we 
traverse the tree from the root, visiting only the nodes which may contain object 
trajectories satisfying the query. This is done by comparing the area coordinates 
specified by the query to the coordinates specifying each node. 

Although, it is possible to handle X and Y coordinates of each object (along 
with time) at the same structure (with tree versions that can handle 3-dimen- 
sional data), following the approach of [12] we handle X and Y coordinates 
independently. This means that we keep one 2-dimensional tree for X coordinate 
along with time and another 2-dimensional tree for Y coordinate along with time. 
We answer a query using each of the trees and then combine the subanswers. 
Accordingly, at each time point we update both trees. 
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3 XBR and PMR 

3.1 The XBR Tree 

The XBR tree consists of two kinds of nodes: the leaves that occupy disk pages 
containing the actual data, namely the line segments, and the internal nodes, 
also occupying disk pages, which provide a multiway indexing method. 

Despite the fact that XBR tree is an indexing method capable of being de- 
fined for various dimensions, for the sake of brevity, in the sequel we assume two 
dimensions. For 2 dimensions the hierarchical decomposition of space is the same 
as the quadtrees. More specifically, the space is subdivided in 4 equal subquad- 
rants, any of which may be further subdivided recursively in 4 subquadrants. 

Internal Nodes. Each internal node in the XBR tree consists of a non-pre- 
defined number of pairs of the form <address, pointer>. The number of these 
pairs is non-predefined because the addresses being used are of variable size. An 
address expresses a child node region and is paired with the pointer to this child 
node. Apparently, both the size of an address and the total space occupied by 
all pairs within a node must not exceed the node size. 

More specifically, the address encoding method that we used works as follows. 
For a binary integer x initially we form code 7 that consists of two parts. The 
first has [log 2 a;J Os and one 1, while the second is the number x — 2 L*°S 2 ^J in 
binary form, expressed with [log 2 x\ bits. The code that we finally use is 5 that 
encodes the number [log 2 a;J -I- 1 with the first part of code 7 (with the two 
parts of 7 concatenaded) and with the second part the same to that of code 7 
(in binary form the number x — 2 L*°S 2 ^ 1 ), More details appear in [17]. 

The addresses being used constitute a representation of a specific subquad- 
rant being produced by quadtree-like hierarchical subdivision of the current 
space. Each address is formed by a number of directional digits each one repre- 
senting a particular subquadrant. That is, NW, NE, SW and SE subquadrants of 
a quadrant are distinguished by the directional digits 0,1,2 and 3, respectively. 
For example, the address 1 represents the NE quadrant of the current space, 
while the address 10 the NW subquadrant of the NE quadrant of the current 
space. 

One of the main novelties of this particular indexing scheme is the fact that 
in reality the region of a child is the subquadrant determined by the address 
in its pair minus the subquadrants corresponding to the previous pairs of the 
internal node to which it belongs. 

For example. Figure 1 depicts an internal node that points to two leaves. 
While the left child region is the SW quadrant of the original space, the right 
child region is the whole space minus the region of the first quadrant. Each * 
symbol denotes the end of a variable size address. In particular, the address of 
the left child is 2*, where the directional digit 2 corresponds to the SW quadrant 
of the original space. Moreover, the right child address is * (i.e. no directional 
digits exist in this address) and the region for this child is the whole space minus 
the first child region. Each address refers to a minimal quadrant covering the 
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Fig. 1. An XBR tree with one internal node and two leaves. 

internal node. In this specific example, the minimal subquadrant is the whole 
space, since the internal node under consideration is the root. 

When a search or an insertion of a line segment is performed, descending the 
tree from the root specifies the appropriate leaves and their regions. At the root, 
the region that has to be checked is the whole space. When visiting an internal 
node, we check in turn every contained pair. The first pair with a subquadrant 
that contains the particular coordinates is chosen and its pointer to the next 
level is followed. By examining this way the pairs in each node, the path being 
followed determines the region under consideration by intersecting it with the 
subquadrant of the chosen pair and subtracting the subquadrants of the pairs 
appearing to the left of this pair. 




Fig. 2. An XBR tree with two levels of internal nodes. 



After an insertion in the left child, this child is split and the result is shown 
in Figure 2. The child split has caused the split of the internal node too, and 
this has led to the new root creation. If someone wants to detect a data element 
marked with ‘x’ , he has to follow the following procedure. At first, he has to 
visit the root and the pairs that it contains. As we can see, the address 2* that 
belongs to the first pair, is the one whose respective subquadrant contains the 
coordinates of ‘x’. Therefore, he has to follow the pointer of the first pair. In this 
node, the address of the first pair, 2*, determines the SW subquadrant of the 
SW quadrant of the whole space, which does not contain the coordinates of ‘x’ 
and the address of the second pair, *, determines the rest of the SW quadrant 
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of the whole space, which contains the coordinates of ‘x’. We follow the pointer 
of this pair and reach the leaf containing ‘x’. 

The multi-way nature of the XBR tree is not explicitly depicted by the two 
examples presented previously. This is done in purpose, only for the sake of 
brevity. 



Leaf Nodes. Each external node (leaf) in the XBR tree may contain a number 
of line segments, which is limited by a predefined capacity C . When an insertion 
causes the number of line segments of a particular leaf to exceed C, then the 
leaf is spit following a hierarchical decomposition analogous to the quadtree 
decomposition, until the resulting regions contain line segments that are less 
than X X C and more than (1 — x) x C, where 0.5 < x < 1. 

The constant x is chosen in order to affect the number of necessary subdi- 
visions and the size of addresses that are created after the split. A choice of 
a value close to 0.5 has proven to cause more subdivisions and larger sizes of 
addresses. This is due to the fact that hardly ever does a partition split the leaf 
in subregions that contain almost equal numbers of elements. 

With such a value assigned to x, we can achieve a better guarantee as far as 
the space occupancy of leaves is concerned. On the contrary, the PMR quadtree 
partitions each leaf once and only once. Such a method requires overflow pages. 
This is not the case for the XBR leaf however, which is guaranteed to contain 
at most C line segments plus the number of directional digits needed to reach 
this leaf. Furthermore, in PMR quadtrees there is no minimum occupancy of 
leaf nodes. 

If we consider that x is assigned the value 0.75, then after continuous inser- 
tions of line segments in the NW corner of the right leaf region of Figure 2, the 
region of this leaf splits in four. If from the subregions formed none contains less 
than 3C/4 and more than C/4 line segments, then the subregion containing the 
larger number of data elements is split in four. This procedure is repetitively 
applied until there exists a region with less than 3C/4 and more than (7/4 line 
segments. Then the original leaf will split in two leaves: the subregion created 
above will represent the region of the left of the two resulting leaves. The rest of 
the original region is the new right leaf region. Following this policy, both leaves 
created will be at least 1/4 full. This situation is depicted in Figure 3. 



Splitting of Internal Nodes. An internal node overflow causes a split in two 
in a way that achieves a good balance between the space use in the two nodes. 
In order to perform the split, first of all we construct a quadtree that has as 
nodes the quadrants existing in the XBR internal node to be spit. This is shown 
in Figure 4a. This node contains addresses that subdivide the node region. Each 
address corresponds to a quadtree node represented by a square. By following 
the path to this node, all intermediate quadtree nodes are marked as circles. 
There also exists the possibility, that a square may be the ancestor of others 
squares (for example, the square of address 0* is a parent for the squares of 
addresses 00*, 01* and 02*). The address * specifies the quadtree root. 
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Fig. 3. The XBR tree after splitting the rightmost leaf of the tree in Figure 2. 



To each node, we assign the number of squares that will be freed, when 
we eliminate the subtree rooted at this node. A bottom-up procedure rapidly 
calculates this number. In Figure 4c each external square is assigned the value 
1: the squares of 100*, 101*, 00*, 01* and 02* are all assigned the value 1. Each 
internal square is assigned the sum of values of its children plus 1. For example, 
the square of 0* is assigned the value 4=l-|-l-l-l+l. Finally, a circle is assigned 
the sum of values of its children only. For example, the second root child is 
assigned 2, since it has only one child (another circle) with value 2. Next we 
traverse the tree in order to find a node, apart from the root, which is a square 
and is assigned the largest number of squares in the tree. 



00 *, 01 *, 02 *, 0 *, 100 *, 101 *,* 




(b) 



DEPTH 




SB 

100 * 101 * 



0 *, 1 *, 2 *,* 

(d) 



100 *, 101 *,* 
(e) 



Fig. 4. Splitting of an internal XBR tree node. 
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For example, in the tree of Figure 4c, the node assigned with the largest 
number of squares, is the leftmost root child with address ‘0’ and number of 
squares equal to 4. The sought subtree is rooted at this node. The resulting two 
nodes are depicted in Figure 4d and 4e. Apparently, in the father node of the 
original internal node, the entry 0*, which corresponds to the minimal quadblock 
of the left of the resulting nodes, should be inserted. 



Deletion. Deletion is used while updating the location and velocity of each 
object at each time point. That is, the last line segment of the trajectory of each 
moving object is updated at the end of each time interval (in general, it must 
be deleted and reinserted to reflect the actual data) and a new line segment, 
expressing the expected trajectory from the new current to the next time point, 
is inserted. 

Since a line segment may cross the regions of several XBR-tree leaf nodes, 
it has to be removed from all these leaf nodes. Following a procedure similar to 
the insertion of a line segment, at each internal node (starting from the root) 
we sequentialy examine the <address, pointer> pairs and recursively visit child 
nodes with regions that are crossed by the line segment. This way, we determine 
all the leaves that are crossed by the line segment (the line segment must be 
deleted from each of these leaves). 

If a leaf node from which we remove the line segment underflows (if it contains 
less than (1 — a;) x C line segments), then a merge occurs. First, the <address, 
pointer> pair corresponding to this leaf that resides in its parent internal node is 
deleted. Then, the rest line segments of the leaf node are added to the rightmost 
child of the parent internal node (the righmost brother of the leaf node). If this 
child overflows, then it is split (as described in the “Leaf Nodes” part of the 
current subsection) and the split may propage to higher levels (hosting internal 
nodes). 

Since internal nodes do not have a minimum occupancy theshold, the merge 
process is not applied to internal nodes. A more sophisticated deletion process 
that considers alternative merging of an underflowed leaf node with other brother 
leaves, except for its rightmost brother and merging of internal nodes is currently 
under development. 

3.2 The PMR Tree 

The PMR tree [12] is an indexing scheme based on quadtrees, capable of index- 
ing line segments. The internal part of the tree consists of an ordinary region 
quadtree (degree four tree) residing in main memory. The leaf nodes of this 
quadtree point to the bucket pages that hold the actual line segments and reside 
on disk (Table 1). Each line segment is stored in every bucket whose quadrant 
(region) it crosses. A line segment can cross a region of a bucket either fully or 
partially. The PMR tree was proposed as an access method where the index is 
kept in main memory, whereas the indexed data, namely the line segments, are 
stored in secondary storage. 
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Table 1. RAM and Disc usage for XBR-trees and PMR-trees. 





XBR-trees 


PMR-trees 


RAM 


Pointer to Root 


Internal Nodes 


Disk 


Internal Nodes 
and External Nodes 


External Nodes 



Insertion in the PMR tree. A line segment is inserted in a PMR tree by 
being registered in the buckets that correspond to the quadrants that it crosses. 
During that procedure the capacity of each bucket that is intersected by the 
line segment is checked in order to verify whether that insertion causes it to 
exceed the predefined bucket capacity. If the bucket capacity is exceeded, then 
the bucket is split once and only once into four equal quadrants (if the bucket has 
already been split, then a chain of overflow buckets is maintained). Therefore, 
the bucket capacity is a split threshold. When a bucket is split, four new buckets 
are created, each one corresponding to a single subquadrant of the quadrant of 
the original bucket. After this procedure is performed, the old parent bucket 
is no longer in use. On the contrary, the quadtree pointer (in main memory) 
that used to point to that bucket now points to a new quadtree node with four 
pointers that point to the four newly created buckets. 



Deletion in the PMR tree. A line segment is deleted from a PMR quadtree by 
being removed from all the buckets that correspond to quadrants that it crosses. 
During this procedure, the capacity of the bucket and its siblings are checked 
in order to discover if the deletion causes the total number of lines segments in 
them to be less than a split threshold. If the split threshold is greater than the 
capacity of the bucket and its siblings, then they merge and the merge procedure 
is then repeated to the parent quadtree node. 

4 Experimentation 

For the tree implementations and the experiments execution we used a Pentium 
1600-MhZ with 1024K memory. The page size used was 4k, which resulted in a 
leaf node containing 204 lines. After experimentation, we came to the conclusion 
that the use of a buffer of lOOK with least-recently-used page replacement, has 
shown better performance in comparison to other choices. Therefore, except 
these 100 disk pages, the entire index comprising both the internal and the 
external nodes, reside on disk. 

For the experiments execution, we considered 1000 time units, being sepa- 
rated into 100 equal time intervals, each one of 10 time units. We conducted 
several experiments with a varying number of moving objects N, and a different 
size for the range query, which is successively set to 0.1, 0.01, 0.001 of the total 
space into consideration. At time unit 0, we randomly generate a velocity and 
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an initial location for each object. We assume that during each time interval 
the object velocity and location are constant. At the interval end though, these 
numbers are updated for each object. This procedure is repetitively applied until 
the end of the time horizon being considered is reached. 

The queries performed are range queries and during an experimental execu- 
tion they are repeated after 10 constant time intervals. During the experiments 
execution we count the I/O cost for the queries, the cost for the pages that are 
not found in the buffer and are read from the disk, the execution time cost, the 
average number of repetitions of each line and the number of nodes that reside 
on disk for the XBR tree and the number of nodes that reside either in disk or 
in memory for the PMR tree. 

Since both the XBR tree and the PMR tree, belong to the quadtree families, 
each line segment inserted in them is not kept in a single but in more than one 
leaves. Therefore counting and comparing the number of the appearances of a 
line segment in the two trees is considered to be noteworthy. 




100 200 300 400 500 600 700 800 

Number of objects 

Fig. 5. Disk Accesses for queries with range 0.1 (top left), 0.01 (top right) and 0.001 
(bottom). 

The first three experiments presented in Figure 5 study the number of disk 
accesses. Namely, we counted the number of disk accesses that were required 
for both trees during the execution of the range queries. In each experiment the 
parameters are the number of objects and the size of the range query. The objects 
vary from 100 to 800, whereas the query size takes values 0.1, 0.01 and 0.001. 
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Figure 5 top left illustrates the number of disk accesses for range queries 
performed for a query size equal to 0.1 of the whole space under consideration. 
In this figure, the XBR tree requires significantly fewer disk accesses than the 
PMR tree. Figure 5 top right depicts the number of disk accesses for queries size 
equal to 0.01. As in the previous experiment, the number of disk accesses made 
by the XBR tree are again significantly fewer than those made by the PMR tree. 
Figure 5 bottom presents the number of disk accesses for the 0.001 query sizes. 
In this case, the PMR tree requires fewer disk accesses during the execution 
of the range queries. However, its difference from the XBR tree is very small 
(notice the scale on the y-axis). This reverse behavior is easily explained. For 
a very small query size, a small number of leaves is accessed. In the XBR tree 
(unlike the PMR tree), the internal nodes that are used to reach the leaves, also 
reside on disk and contribute to the number of disk accesses. This is the penalty 
the XBR tree has to pay, in order to be capable to handle very large amounts of 
data (unlike the semi- RAM based PMR quadtree). As it will be shown later in 
this section, in the experiments that study execution time cost, even under this 
situation, the XBR tree outperforms the PMR quadtree. 



Number of nodes 





Fig. 6. Number of Nodes (left) and Medium Number of Repetitions Per Line (right). 



The next two experiments presented in Figure 6, study the number of nodes 
required for both trees and the average number of appearances for the lines 
inserted. In each experiment, the parameters are the number of objects under 
consideration, which varies from 100 to 800. 

Figure 6 left shows (in thousands) the number of nodes required by the two 
trees. During the experiment evolution, the PMR tree grew and was made up 
of nodes that resided either on disk, or in main memory. By definition, all the 
XBR tree nodes reside in the disk. The nodes in the former case were by far 
more than the ones in the latter case. This means that the XBR tree (due to its 
multiway nature) has a smaller height than the PMR tree, which will help the 
tree to answer the queries more effectively. 

Since both the PMR tree and the XBR tree are quadtrees (and subdivide 
space in a predefined manner), it follows that the lines inserted in them are 
not kept in a single leaf. Each line segment may intersect and be inserted in 
several leaves, a fact that can delay the query processing. The average number 
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of appearances per line inserted is presented in Figure 6 right. The parameter 
in this experiment is the number of objects, which again varies from 100 to 800. 
In this experiment, the XBR tree again stored each line segment in fewer leaves 
than the PMR tree. 

To sum up the results from the experiments in Figure 6, we come out that the 
XBR tree is a more compact tree, with smaller height, occupying fewer nodes 
than the PMR tree. Furthermore, the lines inserted in the XBR tree are not 
repeated as many times as in the PMR case. The second result, namely that the 
inserted lines are repeated more times in the PMR tree is a logical conclusion 
drawn from the fact that the PMR has more nodes. This means that the lines 
inserted in the tree have to be repeated in more nodes. 






Fig. 7. Elapsed time for queries with range 0.1 (top left), range 0.01 (top right) and 
range 0.001 (bottom). 



The next three experiments study the time elapsed during each query exe- 
cution (execution time cost). For each tree, we performed 10 queries, each one 
during a constant time interval. The parameters in these experiments are the 
number of the query from 1 to 10, the query range that takes values 0.1, 0.01 
and 0.001 and the number of objects, which takes values 500, 700 and 500. Since 
in each experiment there are 10 queries performed, in each figure there are 10 
numbers corresponding to the elapsed time. 

In Figure 7 top left the experiment was conducted with 500 moving objects 
and query size equal to 0.1. In this figure, the execution time for the XBR tree 
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is significantly less than the time for the PMR tree. Another observation, as far 
as the experiment is concerned, is that the time spent for the PMR tree shows 
an instant great increase during the queries from 5 to 8. This instability in the 
PMR tree shows that there is no guarantee for the time needed, which can be 
either small or bigger. In Figure 7 top right 700 moving objects are considered in 
combination with 0.01 queries. As in the previous experiment, the time needed 
by the XBR tree in this figure is significantly less. Furthermore, the time spent 
by the PMR tree still appears to show a great instant increase during the queries 
from 4 to 8. Finally, in Figure 7 bottom there are 500 moving objects and 0.001 
range queries. In this case, again, the XBR tree consumes significantly less time 
than the PMR tree for all the range queries conducted. Note that this happens 
contrary to the higher number of disk accesses needed by the XBR tree (Figure 
5 bottom). In other words, although for 0.001 range queries the PRM quadtree 
slightly outperfoms the XBR tree in disk accesses, overall (in execution time) 
the XBR tree significantly outperforms the PMR quadtree. 




Fig. 8. Node Accesses. 

As a final experiment, in Figure 8 we counted the number of disk accesses 
that were required for the XBR tree. The parameter in this experiment is the 
number of moving objects, which varies from 1000 to 5000, whereas the query 
range into consideration is 0.1 of the whole space. The reader may ask why we 
have not presented results for the PMR quadtree also. The answer is that, for 
these cardinalities of moving objects, the execution needed to gather such results 
for the PMR quadtree was excessive. 



5 Conclusions and Future Work 

Considering the great application demands for monitoring of mobile objects, to 
be able to efficiently locate and answer queries related to the position of these 
objects in time is very important. More specifically, modern applications, such as 
Mobile Computing and Geographic Information Systems, make the development 
of research within this field inevitable. 
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In this paper, we proposed a method, XBR trees, that follows the Embed- 
ding Space Hierarchy and can efficiently keep track of the moving objects his- 
tory and answer spatiotemporal range queries. According to the experimenta- 
tion presented, this scheme is indeed efficient. In all experiments conducted, the 
XBR tree outperforms the PMR quadtree (the only Embedding Space Hierarchy 
method up to now that has been proposed for monitoring moving objects). Even 
in the case of very small query ranges, where the PRM quadtree slightly outper- 
formed the XBR tree in disk accesses, overall (in execution time) the XBR tree 
significantly outperformed the PMR quadtree. 

The queries answered during experimentation were statiotemporal range que- 
ries that tracked the history of the motion of the moving objects. One possible 
future extension to this investigation is the implementation and experimentation 
with other types of spatiotemporal queries. Such types may include 

— nearest neighbor queries (e.g. ‘indicate the nearest neighbor of an object at 
each position of its trajectory’, a query of great importance), or 

— spatiotemporal joins, namely queries that deal with moving objects combined 
with moving regions (e.g. ‘Find all airplanes that intersect clouds while they 
move’). 

Future research may also include 

— experimenting with 3-dimensional versions of the quad-based trees (to com- 
bine time and the two coordinates X and Y in a single indexing structure), 

— employing alternative buffering methods to improve tree performance, or 

— comparing the XBR tree indexing method with other spatiotemporal trees 
(especially ones following the Object Space Hierarchy, like R-tree based 
structures), so that we can come up with more general conclusions about 
the winner structure for moving objects management. 
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Abstract. In content-based music retrieval systems, since both the correctness 
and the performance of retrievals are important, a few content-based music re- 
trieval systems have the melody index which contains the representative melo- 
dies of music to be likely used as users’ queries. In this paper, we describe the 
development of a content-based music retrieval system in which the multidi- 
mensional index of time-sequenced representative melodies extracted appropri- 
ately based on musical composition forms is used to support quick and appro- 
priate retrievals to users’ melody queries. From the experimental results, we can 
see that the developed system can retrieve more relevant results than previous 
systems with smaller storage overhead than whole melody index. 



1 Introduction 

Most of currently working music retrieval systems are based on only metadata of 
music, such as title, name of composer, name of singer, and words of song([l], [2]). 
In these traditional music retrieval systems, users should recall and specify metadata 
of music they want as users’ queries. However, this type of querying is unnatural due 
to the general fact that people prefer to remember a part of music itself rather than its 
metadata. Therefore, content-based music retrieval system in which users can query 
by some melodies they remember is essentially required. 

As content-based music retrieval systems, a few systems have been developed([3], 
[4], [5], [6]). In these systems, users specify query melodies by humming, by playing, 
or by drawing a part of music remembered as the representative melodies. Here, by 
the representative melodies of music we mean that they are semantic delegation of 
music’s melodies such as the first melody, the climax melody, and the repeated theme 
melodies of the music, and people can remember these melodies for the music so that 
users are likely to use these melodies as query melodies([7]). Then, the system re- 
trieves the music information according to the similarity between user’s query melody 
and the melodies of music database. 

However, since most of the previous content-based music retrieval systems do not 
have the effective indexing mechanism, users may face with long response time due 
to the time-consuming syntactic processing for retrievals in which an approximate 
matching between query melody and all melodies of underlying music database 
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should be performed. Other some of the previous systems([6], [8]) have the first mel- 
ody index in which only the beginning part of each music is included as the indexed 
melody. In general, however, since people are likely to remember music by its repre- 
sentative melodies not by only its first beginning part, these systems with the first 
melody index cannot support user queries of all possible representative melodies 
including climax melody and repeated theme melodies of music. 

To remedy the above problem, some researches([9], [10], [11], [12]) have concen- 
trated on extracting repeated theme melodies from a music file. In [9] and [10], since 
authors consider only the exactly repeated patterns not approximately repeated pat- 
terns as theme melody in a music file, the theme melody index that contains only the 
exactly repeated theme melodies extracted by the mechanism of [9] or [10] is not 
enough to support users’ queries of the representative melodies. However, an our 
previous work([12]) proposed a theme melody extraction mechanism in which an 
extended graphical clustering algorithm is used for grouping the approximately re- 
peated melodies into one cluster with considering musical composition forms and a 
melody is extracted from each cluster as an approximately repeated theme melody. In 
addition to the approximately repeated theme melodies of music, the first melody and 
the climax melody of the music are augmented into the final representative melody 
set. Thus, the representative melody index can well support users’ queries in which 
these kinds of melodies are used. 

In another our previous work([13]), we developed a content-based music retrieval 
system in which the 2-dimensional representative melody index is used. In this sys- 
tem, since the dimension of the metric space for the representative melody index is 
just 2, several melodies that have totally different music patterns may be placed 
within a close distance. Hence, even though the signature of melody that stands for 
the melody’s variation patterns is used to distinguish these melodies, we have slightly 
inappropriate results sometimes in the previous system. These inappropriate retrievals 
mainly come from the excessive reduction of semantic features of melody into a point 
of 2-dimensional metric space with two-axes, average length variation and average 
pitch variation of melody. 

In this paper, we discuss the development of a content-based music retrieval sys- 
tem in which the multidimensional index of time-sequenced representative melodies 
is used to enhance the correctness and the performance of retrievals from users’ in- 
complete melody queries. Basically, since melody comes out from continuous ar- 
rangement of notes as time goes, a melody can be transformed into a time sequence 
data. In this work, a representative melody of motif length is transformed into an 8- 
dimensional time-sequenced data, and it is mapped into a point of 8-dimensional 
metric space of M-tree([14]) for the representative melody index. We also discuss the 
procedure for content-based retrieval using the multidimensional time- sequenced 
representative melody index and compare the performance of the proposed system to 
that of the previous one. 

The rest of this paper is organized as follows. Section 2 discusses the overall archi- 
tecture of the developed content-based music retrieval system. In section 3, we dis- 
cuss the systematic construction of the multidimensional time-sequenced representa- 
tive melody index. In section 4, we discuss the procedures and the appropriateness of 
the content-based music retrievals using the multidimensional time-sequenced repre- 
sentative melody index. We also discuss the performance of the system. Finally, this 
paper is concluded with future works in section 5. 
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Fig. 1. Architecture of the Proposed Content-based Music Retrieval System 



2 Architecture of Content-Based Music Retrieval System 

The overall architecture of the content-based music retrieval system using the multi- 
dimensional time- sequenced representative melody index is shown in Fig. 1. It con- 
sists of Registration interface, RM extractor & indexer, M-tree engine, Music data- 
base, Query & Result interface, MIDI generator. Query processor, and Ranker. 

From the new music file submitted via Registration interface, RM extractor & in- 
dexer extracts the representative melodies. This work is accomplished by the unit of 
motif since it is the minimum meaningful unit in music semantics([15]). This module 
has two primary sub-modules; Similarity computation and RM clustering. Similarity 
computation module computes the similarity values of all pairs of motifs of music. 
And, RM clustering module classifies motifs of the music file into one or more groups 
in each of which only the similar motifs to each other are included. The more detail of 
extracting representative melodies from a music file is discussed in section 3.1. 

The appropriately extracted representative melodies from a music file are stored in 
the multidimensional time-sequenced representative melody index implemented by 
M-tree([14]), a well-known multidimensional indexing scheme. To place the ex- 
tracted representative melodies into the multidimensional metric space of M-tree, a 
representative melody of motif length is transformed into an 8-dimensional time- 
sequenced data, and it is mapped into a point of 8-dimensional metric space of M- 
tree. The more detail on converting a melody of one motif into 8-dimensional time- 
sequenced data is discussed in section 3.2. 

Query & Result Interface consists of three querying modules and one for the rele- 
vance feedback. Up to now, since the drawing interface is well tested and is good for 
viewing the user’s query melody, we will use this interface hereafter. 

In this work, we assume that music file is in MIDI(musical instrument digital inter- 
face) format since it is a well-known standard for computer music. MIDI generator 
transforms a query melody of humming, drawing, or playing into MIDI format as the 
intermediate format. Then, Query processor makes the retrieval features of query 
melody. It retrieves the relevant melodies from the multidimensional time-sequenced 
representative melody index of M-tree by using A:-nearest neighbor search algorithm. 
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Ranker decides ranks of the retrieval results according to the distances between the 
query melody and the retrieval results. Since a query melody is of different length 
from representative melodies in the representative melody index, we use a time- 
warping distance function([16], [17]) to compute the distance between them. To en- 
hance the appropriateness of retrieval results, users can go through the user relevance 
feedback phase via the relevance feedback module of Query & Result interface. The 
more details concerned on retrieval and user relevance feedback are discussed in 
section 4. 



3 Systematic Construction of the Representative Melody Index 

3.1 Extraction of Representative Melodies 

The summarized procedure for the extraction of the representative melodies(RMs) 
from a music file is shown in Fig. 2. 



Input a music file 




Register the music file with the extracted RMs 



Fig. 2. Procedure for Extracting Representative Melodies from Music File 

When a music file is submitted, the features such as time signature, pitch and 
length of notes are extracted from the submitted music file. By using the feature in- 
formation, we decompose a music file into the set of motifs, since a motif is the se- 
mantic unit of music composition. Then we compute the similarity values between all 
pairs of the motifs by using the similarity computation algorithm([18]). Then, the 
similarity matrix can be constructed. The motifs of a music file are clustered based on 
the similarity values by using the proposed RM clustering algorithm that considers the 
musical composition forms. The detail of the clustering algorithm is in [12]. 

From each clusters, we extract an approximately repeated theme melody as a rep- 
resentative melody based on the position or role of the motif within the music. If a 
cluster includes the first motif or the climax motif, we extract that motif as the repre- 
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Fig. 3. The Music Score of a Korean Children Song ‘Spring of Hometown’ 

sentative melody from the corresponding cluster. Otherwise, we extract a RM from 
each cluster to allow the extracted melody to be the center position of the cluster in 
metric space of M-tree. After extraction of RM from each cluster, if the first motif or 
the climax motif of a music file does not exist in the extracted melody set, we add 
them to the final set of representative melodies for the music file. 

As an example of extracting the representative melodies, we will use a Korean 
children song, ‘Spring of Hometown’ of Fig. 3. The song consists of 8 motifs and has 
a pattern as its musical composition form, A-B-C-D-E-B'-C-D. From the musical 
composition form, we can expect that 8 motifs should be clustered into three similar 
groups, {B, B’}, {C, C}, and {D, D). 

When we input the MIDI of ‘Spring of Hometown’ in Fig. 3 for registration, we 
can see the screen of Fig. 4. We can see the information for the representative melo- 
dies extracted from the music file at the lower window. 

To see the more details of RM clustering, when users click the ‘Representative’ 
button of the rightmost frame we can see the screen of Fig. 5. In this screen, we can 
see the similarity matrix, the clustering results with threshold value by the RM extrac- 
tion algorithm of [12], and the climax motif of the music file. From the similarity 
matrix, we can easily recognize that ‘Spring of Hometown’ has the musical composi 
tion form of A-B-C-D-E-B'-C-D since the similarity values between 3'“' and motifs, 
4* and 8* motifs are 100 and the similarity value between 2"“* and 6'*’ is almost near to 
100, exactly 99. That means we have final three clusters (2, 6], (3, 7|, |4, 8}. From 
these clusters, as shown in Fig. 5, we extract the motifs 6, 3, 8 as the representative 
melodies, respectively. And T“ motif is augmented into the final set as the first mel- 
ody. Note that 3'** motif is also the climax melody of the music file. 



3.2 Multidimensional Index of Representative Melodies 

To place an extracted melody into the multidimensional metric space of representative 
melody index of M-tree, since we regard the melody of music as time- series data that 
consists of sequences of values or events as time goes, we transform a representative 
melody into 8-dimensional time-sequenced data. That is, melody of a representative 
motif is translated into a time-sequenced data of <r,, r^, tj, r,, r^, r,, t^>. From the 
analysis of our experimental music database of 300 songs, we can see that the de- 
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Fig. 4. Registration Window for ‘Spring of Hometown’ 




Fig. 5. RM Clustering Result for ‘Spring of Hometown’ 



nominator of time-meter is either 4 or 8. Therefore, we translate representative mel- 
ody into a time-sequenced data of 8 sequences. 

To translate a representative motif into a time-sequenced data of 8 sequences, we 
revised the length of a musical note with time. In a motif, the length of a musical note 
is revised according to Table 1, and the pitch value of note is revised according to the 
pitch value of MIDI format, respectively. For example, a quarter note(J) of ‘C5’ is 
represented as the note whose length is 8 and pitch is 60. 

So, we intermediately translate a representative melody of n/m time-meters into 
16« sequences and re-organize them into 8 sequences by using the average value of 
each 2n sequences from the original I6n sequences. Fig. 6 shows an example of trans- 
lation of the F‘ motif of a Korean Children Song ‘Spring of Hometown’ in Fig. 3 into 
an 8-dimensional time-sequenced data. Fig. 6(a) denotes the graph of the revised pitch 
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Table 1. Revised Length of a Musical Note 



Revised length value 


32 


16 


8 


4 


2 1 


Note 


0 


J 


J 






Rest 


- 


- 




V 






time 




1 2 3 4 5 6 7 8 

time 



(a) Revised pitch values with time (b) Translated time-sequenced data 

Fig. 6. Time-Sequenced Representation of the 1" Motif of ‘Spring of Hometown’ 



values of continuous notes in the motif with time, and Fig. 6 (h) is the 8 -dimensional 
time-sequenced data from it. 

However, since comparing time-sequenced data of absolute pitch values depends 
on the key of song, we use the relative time-sequenced data. The relative pitch of 8 - 
dimensional time-sequenced data <i,, s^, s,, ^ 4 , i,, i,, s^> comes from the normaliza- 

tion of the absolute time-sequenced data by Equation (1). To distinguish between note 
and rest, we add 1 to s,. Therefore, we can get the final result {3.50, 3.50, 1.00, 3.50, 
5.50, 5.50, 3.50, 3.50} translated from the F' motif of ‘Spring of Hometown’ as 
shown in the lower window of Fig. 4. 

Sj =f.-MIN{t„.., tjH-1 (1) 

The time-sequenced data transformed from a representative melody is mapped into a 
point of the metric space for the representative melody index implemented by M-tree. 
Since we translate a melody into an 8 -dimensional time- sequenced data, the dimen- 
sion of M-tree becomes 8 and a representative melody is mapped into a point of 8 - 
dimensional metric index space with radius. Here, radius stands for the largest dis- 
tance computed by Euclidean distance function between the selected representative 
melody and other melody in the cluster. 
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4 Content-Based Music Retrieval from Melody Query 

4.1 Procedure of Content-Based Music Retrieval 

The summarized procedure for the content-based music retrievals using the multidi- 
mensional time- sequenced representative melody index(for short in figure, MDTSRM 
index) is in Fig. 7. When a user’s melody query with the expected number of results n 
is submitted to the query interface, the system extracts the feature information from 
the query and translate the query melody into an 8-dimensional time-sequenced data 
by the same way for indexing. With these features of the query, we do A:-nearest 
neighbor searching from M-tree of the multidimensional time-sequenced representa- 
tive melody index. In this work, we use 2n as k value in A:-nearest neighbor searching 
in order to enrich the candidates for more appropriate retrieval results. 

To filter out inappropriate melodies from 2n candidates and to rank the retrieval re- 
sults, we compute the similarities between the query and 2n candidates by time- 
warping distance function. As well known, time-warping is able to compute the dis- 
tance between two melodies of different lengths. Hence, we can retrieve the appropri- 
ate melodies from the representative melody index even though users submit query 
melodies that are not of one motif length. According to the similarity values, we 
choose top n melodies from 2n candidates and decide the ranks of the final retrieval 
results. 

Users view the final retrieval resnlts based on their ranks and check the relevance 
for each music file in the retrieval results via viewing its music score with a visual 
interface and/or listening the music with an auditory interface. If users do not satisfy 
the retrieval results, users do over again the relevance feedback from the previous 
query or the previous results until they get music files they want. In the relevance 
feedback phase, we do range searching within in the multidimensional time- 
sequenced representative melody index. The retrieval range of the relevance feedback 
is adjusted according to the degree of user’s satisfaction to the previous retrieval re- 
sults. 



Input a query melody 




Final retrieval reults 



Fig. 7. Procedure of Content-based Music Retrievals Using the Representative Melody Index 



254 



K.-I Ku et al. 



4.2 Content-Based Music Retrieval Using the Representative Melody Index 

For content-based music retrievals in the developed system, we use the query inter- 
face as shown in Fig. 8. In this interface, user can draw the music scores of query with 
the selection of expected number of results at ‘Result Num’. ‘Play’ and ‘Stop’ buttons 
are to start and to stop listening the query melody, respectively. After drawing a query 
melody, by click ‘Query’ button users can get the retrieval results as shown at the 
lower window of Fig. 8. The retrieval results in the lower window of Fig. 8 come 
from the multidimensional time-sequenced representative melody index. However, 
when we select a row in the results, we can see the full music score that is stored in 
the underlying music database. After viewing and listening the retrieval results, users 
can advance the relevance feedback phase for a selected result at the lower window of 
Fig. 8 by clicking ‘Feedback’ button. 




Fig. 8. Query and Result from 8-dimensional Time-Sequenced RM Index 

To validate the appropriateness of the retrieval results from the multidimensional 
time-sequenced representative melody index, we list up the retrieval results from the 
query melody of Fig. 8 at Table 2. From Table 2, we can easily recognize that the first 
result melody has exactly same music pattern to the prefix of the query melody given 
in Fig. 8, and also other retrieval results have similar music patterns to the query mel- 
ody. 

Table 2. Summerization of Retrieval Results for Query Melody of Fig. 8 



Rank 



Music Scores 



J' l J J 



^ -I' [ I 1 ^ | i ^ 



i f J 



r i r r 
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Instead of using 8-dimensional metric space of M-tree, our previous system used 2- 
dimensional one. That is, to place melodies into the metric space of M-tree, we com- 
pute the average length variation and the average pitch variation of each melody and 
the radius of each cluster. If we assume that a representative melody of n/m time- 
meters has k continuous notes, [(/^, p,), {l^, p^), ..., (l^, pj], where / and p- are the 
length and pitch of i-th musical note in the melody, respectively. The average length 
variation / and the average pitch variation p are computed by Equation (2) and (3), 

respectively. In Equation (2), the first term denotes the average length difference of k 
musical notes in the representative melody to the dominator m of the times of the 
music and the second term denotes the average value of k-1 length differences be- 
tween continuous k musical notes. Similarly, in Equation (3), the first term denotes 
the average value of pitch differences between the first musical notes and the follow- 
ing k-1 ones and the second term is for the average value of k-1 pitch differences 
between k continuous musical notes. And the distance d(v, u) between two representa- 
tive melodies u(l ^ p J and v(l ^ p J is computed by the Euclidean distance in 2- 
dimensional space. The radius of a cluster stands for the maximum distance between 
the extracted representative melody of a cluster and other melodies in the cluster in 2- 
dimenstional metric space of M-tree. 

I = ((Yim -lj)/k + - /,. l)/(k -l))/2 ( 2 ) 

/=1 i=l 



1=1 2 



■))/(^-l) 



( 3 ) 



If we use the previous system in which a 2-dimensinal metric space is used instead of 
8-dimensional metric space for the representative melody index with same music 
database, the retrieval result for the same query melody of Fig. 8 is shown in Fig. 9. 
Since the average length variation and the average pitch variation of query melody are 
2.74 and 1.28, respectively, the melodies which have similar values to them are re- 
trieved and are ranked based on the distance between the query and these melodies as 
shown in Fig. 9. 




Fig. 9. Query and Result from 2-dimensional Time-Sequenced RM Index 
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Table 3. Summerization of Retrieval Results for Query Melody of Fig. 9 



Rank Music Score 




To compare both the music scores in Table 2 retrieved from the 8-dimensional time- 
sequenced representative melody index and those from the 2-dimensional one, we 
also list up the retrieval results in Table 3. From the comparison of retrieval results of 
Table 2 and Table 3, we can recognize easily that the retrieval results of Table 2 are 
more relevant to the query melody than those of Table 3. 



4.3 Performance Evaluation 

We do experiments with a music database of 300 Korean children songs. From the 
experimental database, we get 753 representative melodies, whereas the total number 
of motifs in the database is 2,151. The ratio of number of representative melodies to 
that of whole melodies is 0.35. It means that by using the representative melody index 
instead of whole melody index may save the storage overhead for index up to 65%. 

In addition, to compare the correctness of the retrieval results, we do experiments 
with another experimental music database of 100 Korean children songs. In the ex- 
perimental music database, 10 songs come from the artificial remake of the original 
‘Spring of Hometown’, while 90 songs are selected from the first experimental music 
database of 300 Korean children songs. We index these representative melodies ex- 
tracted from the experimental music database separately into 2-dimensional metric 
space of M-tree and 8-dimensional metric space of M-tree. To compare the precisions 
of retrieval results for two cases, we use the F‘, 3'“*, 6*, 8"^ melodies as query melodies. 
As shown in Fig. 10, the average precision of using 8-dimensional time-sequenced 
representative melody index is higher than that of 2-dimensional one. Therefore, we 
can recognize that the 8-dimensional time- sequenced representative melody index is 
more useful for content-based music retrieval systems. 




123456789 10 

Number of Retrieval Results 



Fig. 10. Comparison of Precisions 
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5 Conclusions 

In this paper, we discuss the development of a content-based music retrieval system in 
which 8-dimensional time-sequenced representative melody index is used to improve 
the appropriateness and the performance of retrieval from users’ melody queries. 
Since music composition is regarded as an arrangement of continuous notes within 
the time-meter, music melody can be transformed into 8-dimensional time- sequenced 
data. Hence, we first introduce the overall architecture of the content-based music 
retrieval system developed in this work. We also discuss the construction of the 8- 
dimensional representative melody index from the extracted melodies and the con- 
tent-based retrieval procedure with relevance feedback from users’ melody queries. 
According to the experimental results, the system can save the index space up to 65% 
than the case of using the whole melody index while the precision of retrieval results 
from the proposed system is higher than that for the previous system in which 2- 
dimensional index is used for the representative melody index. 

As the future work, we will develop a music classification scheme and recommen- 
dation system for users based on the similarity between the user’s preference and the 
semantics of music. Then we will use this system in developing an e-commerce sys- 
tem for digitized music files. 
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Abstract. Stochastic query optimization problem for multiple join is 
addressed. Two sites model of Drenick and Smith [2] is extended to four 
relations stored at four different sites. The model of three joins stored 
at two sites leads to a nonlinear programming problem, which has an 
analytical solution. The model with four sites leads to a special kind of 
nonlinear optimization problem (P). This problem can not be solved 
analytically. It is proved that problem (P) has at least one solution and 
two new methods are presented for solving the problem. An ad hoc 
constructive model and a new evolutionary technique is used for solving 
problem (P). Results obtained by the two considered optimization 
approaches are compared. 

Keywords and Phrases: Adaptive representation, distributed 

database, evolutionary optimization, genetic algorithms, query optimiza- 
tion. 



1 Introduction 

The query optimization problem for a single query in a distributed database 
system was treated in great detail in the literature. Many algorithms were elab- 
orated for minimizing the costs necessary to perform a single, isolated query 
in a distributed database system. Some methods can be found in [9], [1], [12]. 
Most approaches look for a deterministic strategy assigning the component joins 
of a relational query to the processors of a network that can execute the join 
efficiently and determine an efficient strategy for the data transferring. 

A distributed system can receives different types of queries and processes 
them at the same time. Query processing strategies may be distributed over 
the processors of a network as probability distributions. In this case the deter- 
mination of the optimal query processing strategy is a stochastic optimization 
problem. There is a different approach to query optimization if the system is 
viewed as one which receives different types of queries at different times and 
processes more than one query at the same time. 

The multiple-query problem is not deterministic; the multiple-query-input 
stream constitutes a stochastic process. The strategy for executing the multiple- 
query is distributed over the sites of the network as a probability distribution. 



A. Bencziir, J. Demetrovics, G. Gottlob (Eds.): ADBIS 2004, LNCS 3255, pp. 259-274, 2004. 
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The “decision variables” of the stochastic query optimization problem are the 
probabilities that component operators of the query are executed at particular 
sites of the network. 

The main objective of the state-transition model is to give globally optimal 
query-processing strategies. [2] treat the single-join model, the general model for 
the join of two relations and a multiple-join with three relations, which are stored 
at two different sites. The general model for the join of two relations leads to 
linear programming problem. The multiple join model of three relations stored 
at two different sites leads to a nonlinear optimization problem, which can be 
solved analytically. The stochastic model for the join of three relations, which 
are stored at three different sites is presented in [13] and [14]. This model leads 
to a nonlinear programming problem, which is a specific one. Stochastic query 
optimization model using semijoins is presented in [8]. 

The aim of this paper is to extend the stochastic model to the join of four 
relations. In Section 2 the case when the relations are stored at four sites is 
considered. The stochastic query optimization problem in case of four relations 
leads to a constrained nonlinear optimization problem. In Section 3 a construc- 
tive method for solving the nonlinear programming problem is given. We con- 
sider an evolutionary technique based on a dynamic representation [3] [4]). This 
technique called Adaptive Representation Evolutionary Algorithm (AREA) is de- 
scribed in Section 4. The results obtained by applying these different approaches 
are presented in Section 5. Two sets of values for constants are used in these 
experiments. Solutions are nearly the same. The CPU time required for solving 
the optimization problem by using evolutionary algorithm is less than the CPU 
time required by the constructive method. Due to the very short execution time 
of the evolutionary algorithm the method can be applied in real-time systems. 



2 Stochastic Query Optimization Problem of Four 
Relations Join 



Consider four relations stored in different sites of the distributed database. The 
join of these four relations will be defined in the context of stochastic model of 
[2]. Consider relations A, B, C, D stored at the sites 1,2,3 and 4 respectively. 
Denote by Q4 the single-query type consisting of the join of four relations: 

Qi = A\xi B \xiC D. 



Initial state of relations referenced by the query (54 in the four-site network is 
the column vector defined as: 



So 



(^\ 

B 

C 

\DJ 



where the i-th component of the vector sq is the set of relations stored at site i, 
i e {1,2, 3, 4} at time t= 0. 
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Initial state sq is given with time-invariant probability po = p(so) i.e. po is th® 
probability that relation A is available at site 1, relation B at site 2, relation C 
at site 3, and relation D at site 4. The four relations are not locked for updating 
or are unavailable for query processing for any other reason. We assume that the 
input to the system consists of a single stream of type Q4. 

For the purpose of stochastic query optimization we enumerate all logically 
valid joins in the order in which they may be executed. Let us suppose that Q4 
has three valid execution sequences: 



Q4S1 = (((A ex B) tx C) ex D), 



Q4S2 = ((^ ixi B) tx] (C to £>)), 

Q4S3 = (^ to (B to (C to D))). 

Sequence Q4S1 can be applied ii AC\B So the join = A to i? is executed 
before the join C = B' ix C. The last executed join will he D' = C tx D . The 
sequence Q4S2 is adequate for parallel execution. 

The system that undergoes transition in order to execute the join of four 
relations is described in this section as in [2]. The strategy for executing the 
multiple join is distributed over the sites of the network. Conditional probabilities 
are associated with the edges of the state-transition graph. Executing a multiple 
join is equivalent to solve an optimization problem. This problem is referred as 
stochastic query optimization model. Theorem 1 states, that the stochastic query 
optimization model for the multiple join query defines a nonlinear optimization 
problem. 

With respect to the stochastic query optimization model we can state the 
following Theorem, see [5]. 



Theorem 1. The stochastic query optimization model for the multiple join 
query of type Q4 defines a nonlinear optimization problem. 

Stochastic query optimization problem for the query Q4S1 is given by: 



minimize Z\i 

subject to: 

Ti < Ai,i = 1 , 2 , 3, 4 



(Pi) 



Po.ii + P0.21 = 1 ) 
Pll,12 + Pll,22 = 1, 
P2I.32 + P21,42 = 1, 
Pl2,13 + Pl2,23 = 
P22,33 + P22,43 = 1, 
P32,53 + P32,63 = 1, 
P42.73 + P42,83 = 1; 



P0,lljP0,21,Pll.l2,Pll,22,P21,32,P21,42,Pl2.13 S [0,1] 



4'12,23,P22.33,P22,43,P32,53,P32,63,P42.73,P42.83 G [0,1]. 
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where the mean processing times r^, j = 1, 2, 3, 4 are expressed as: 

Tl = T'ii(_B')po,ll + T'i 2 (C")po, 11 P 11,12 + ri3(£)')pO,llPll,12Pl2,13, 

T2 = T2i{B')po^21 + 732(C")po, 21P21,32 + ?53 (£)')po, 21^21,32^32, 53 

13 = 222(C'OPO,11-P11,22 + 242(^^0^0,21^21,42 + 233(220PO,11P11,22P22,33 

+ 273 ( 220 ^ 0 , 21 P 21 , 42^42,73, (2-1) 

7"4 = 223(I?')P0, llPll,12Pl2, 23 + 243(I2')P0,11P11, 22^22,43 

+ 263(22')Po, 21P21, 32^32,63 + 2s3(22')Po, 21^21,42^42,83- 



The obtained problem (Pi) is a constrained nonlinear optimization problem. 
In the next section we propose a constructive approach for solving the optimiza- 
tion problem (Pi). 

The number of relations and sites in one distributed database can be differ- 
ent. Resulting nonlinear optimization problem has different number of variables 
and constraints. Therefore we have to generalize problem (Pi) for an arbitrary 
number of relations and sites. 

Let us consider p continous functions 

/i, ..., fp : [0, 1]" — P+, 

where p is the number of sites in the distributed database and fi,{i = 1, . . . ,p) 
represents the mean processing time at site i. 

Our optimization problem (Pi) may be generalized to the following optimiza- 
tion problem (Pp). 



' minimize Z\i 



iPp){ 



subject to: 

/i(a:i,a; 2 , - • . ,x„) < Z\i, 



fp{xi,X2, ■■■,Xn) < Ai, 
Xi,X2,...,Xn€ [0,1] . 



A new framework is necessary for establishing conditions under which prob- 
lem (Pp) has a solution. 

Let (X, d) be a compact metric space and 

/i, ..., fp : X ^ P+ 



be continuous strictly positive functions. 
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Consider the next generic optimization problem: 



{P)< 



minimize y,y G R 
subject to: 

X G X, (X is a compact metric space), 
y > 0, 
fi{x) < y, 



. fp{x) < y- 

With respect to problem (P) we can state the following Theorem, see [5]. 
Theorem 2. Problem (P) has at least one solution. 



3 A Constructive Method for Solving General Stochastic 
Query Problem 

Now we are ready to give a constructive method for solving problem (P). This 
method generates a sequence converging to a solution of the problem (P). Theo- 
rem 3 ensures that the constructed sequence really converges towards a solution 
of the optimization problem (P). 

Let f : X ^ Rhe the function defined by 



f{x) = max{/i(x), ..., fp{x)}. 



and t/o the global minimum value of the function /, i.e. 

Vo = min/(a:). 

rcGA 



Let Ai c A 2 C A 3 c ... C An C ... be a sequence of finite subsets of X 

00 

such that U An is dense (see [11]) in X, i.e. UA„ = X equivalent to the fact, 

n—1 

that for \/x G X, 3xn G U A„ such that Xn — >■ x. 

n^N 

We consider 

Ai = {■ui,M2, X,i=l,...qi, 

A2 = {vi,V2,...,VgJ, Vj G X,j = 1,...Q2, 

An = {wi,W 2 , Wfe G X, fc = 1 , . . . , 



where qi G N* ,i = 1 , . . . , n and — >■ oo . 
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Let us consider the sequence {yn)n>i defined as Mows: 

yi=min{max{/i('Ui), /2(ui), /p(Mi)}, max{/i(M,jJ, /2(w,J, /p(u,J}, 
?/2=niin{max{/i(ui), /2(ui), fp{vi)}, max{/i(u,2), /2(u,J, fp{vq^)}, 



yn=mm{ma.x{fi{wi),f 2 {wi), ...,fp{wi)}, max{ /i («;,„), / 2 (w,„), ...,fp{wqj}. 

It is easy to see that sequence {yn)n>i is monotone decreasing and bounded. 
Therefore the sequence is convergent. 

With respect to the convergent sequence {yn)n>i we can state the following 
Theorem, see [6]. 

Theorem 3. The sequence {yn)n>i converges to a solution of the problem (P). 
In the case of solving problem {Pp) using Theorem 3 we have 

X = [0, 1]” . 

In order to obtain an approximate solution of problem {Pp) in the Construc- 
tive Algorithm we take a uniform grid G of the hypercube [0,1]^. 

We may choose the sets {Ai)i^jq* in the Mowing way: 

Ai = \ ( e {0,l,...,n}j , 

\^\n n n J J 



Ak 



Ujo 

\ \2^~^n ’ 2^~^n ’ 



where 



2^~^n ) 



\loyh, ■ ■ ■ , ^2'=-ln G {O; 1; • ■ • I 2^ 




^0 ^ ^ ^ ‘In: 

jo < jl < ■ ■ ■ < j2n, 

Iq <C ll < <i l2k-^n- 

Our grid is that induced by Ai, A 2 , . . . , Afc. The sets (Ai)igAr. constructed 
in the above way verify the conditions of Theorem 2. For our purposes we may 
consider n = 10. 

For each point of the grid G we compute the values fs,s = 1, ... ,p. Choosing 
the maximum fs,s = l,...,p,we ensure that each inequality in the problem {Pp) 
holds. Problem solution will be the minimum of all selected maximums. 

The previous considerations enable us to formulate an algorithm for solving 
problem {Pp). This technique will be called Constructive Algorithm (CA) and 
may be outlined as bellow. 
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Constructive Algorithm 



Input: 

n 

Functions 

begin 

Initializations: 




valxj = 0, j = 1, k 

for s = 1 to p do 

valfs = fs{valxi, valx2, 

end for 

valmax = max{valfs, s = 
valmin = valmax 

for j = 1 to fc do 
xmirij = valxj 

end for 

Constructing the grid: 
for = 1 to n do 
valxi = ii * h 
for i 2 = 1 to n do 

valx2 = i2 * h 



II the number of divisions; 

// express the problem constraints. 



/ / the length of one division; 

// initial values for Xj\ 

II initial values for functions /« 

. . . ,valxk) 

//in xmirij we store the Xj values for which we 
// have the minimum of /s 



for ifc = 1 to n do 

valxk = ik*h 

for s = 1 to p do / / calculate the values for functions fs for 
valfs = fs{valxi, valx 2 ,- ■ ■ ,valxk) // the current values of Xj 

end for 

valmax = max{ua^/s, s = 1, ...,p} 
if {valmax < valmin) then 
valmin = valmax 

for j = 1 to fc do // store in xminj the new Xj values 

xminj = valxj // for which we have the 

end for / / minimum of fs 

end if 

end for // ifc 

end for // ^2 
end for // 

end 

Remark, valmin denote the minimum value of Z\i from problem {Pp) and xminj, 
j = l,...,k denote the values for Xj, j = l,...,k for which the minimum is 
reached. 
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The Constructive Algorithm should be repeated for a new value of n, so that 
the divisions have to include the old divisions, in this way we obtain a new subset 
Ai of the set X. 

Solution obtained by the Constructive Algorithm can be refined using a Re- 
finement Algorithm (RA) (see [6]). 

Algorithms CA and RA can be used to solve the general stochastic opti- 
mization problem (P). The problem of four relations join is formulated as the 
problem (Pi), which is a particularization of general problem (P). 

Numerical experiments for solving problem (Pi) using the Constructive Al- 
gorithm and Refinement Algorithm are presented in Section 5. 



4 Solving Problem {Pp) Using an Evolutionary Algorithm 

In this section an evolutionary technique used for solving stochastic optimization 
problem (Pp) is proposed. 

Let us denote 



Qi ^2-t-l p f 5 ■ ■ ■ 5 P- 



Using these notations general problem (P) (see Section 2) can be reformulated 
as the following constrained optimization problem: 



minimize y 






subject to: 

y > 0, 

h{x) <y, i = 
gfix) -1 = 0, 

X = (xi , . . . ,x„). 



The evolutionary method for solving problem (P') implies the next steps: 

Step 1. Maximize the p functions /i,..., fp. 

Let us denote by 



f*{x) = max fi{x), x e X. 

Step 2. Minimize function f* by using an evolutionary algorithm. 

In this case the fitness of the solution x may be defined as: 

eval{x) = f*{x) 

Let us denote by x* the obtained minimum. 



Solving Stochastic Optimization in Distributed Databases 267 



Step 3. The solution y of the problem can be obtained by setting 

y = x* + e, 



where e > 0 , is a small number. 

Remark. The constraints gi were treated by considering each Xi, i not odd, as 
being (1 — Xi). The advantage of applying an evolutionary technique for solv- 
ing problem (P') is that the involved function f* is not necessary effectively 
computed. Only the values of function /* for the candidate solution are needed. 
An evolutionary algorithm called Adaptive Representation Evolutionary Algo- 
rithm (AREA) is proposed for solving problem (P'). AREA technique will be 
described in what follows. 

4.1 AREA Technique 

The main idea of AREA technique is to use a dynamical encoding allowing each 
solution be encoded over a different alphabet. This approach is similar to that 
proposed in [7]. Solution representation is adaptive and may be changed during 
the search process as the effect of mutation operator. 

Each AREA individual consists of a pair {x, B) where a; is a string encoding 
object variable and B specifies the alphabet used for encoding x. B is an integer 
number such that B > 2 and a; is a string of symbols over the alphabet {0, 1, 
. . . , P — 1}. If P = 2, the standard binary encoding is obtained. 

Each solution has its own encoding alphabet. The alphabet over which x is 
encoded may be changed during the search process. 

An example of AREA chromosome is the following: 

C = (301453, 6). 

Remark. The genes of x may be separated by comma if required (for instance 
when P > 10). 

4.2 Search Operator 

Within AREA mutation is the unique search operator. Mutation can modify 
object variables as well as the last chromosome position (fixing the representation 
alphabet). 

When the changing gene belongs to the object variable sub-string {x - part 
of the chromosome), the mutated gene is a symbol randomly chosen from the 
same alphabet. 

Example 

Let us consider the chromosomes can be represented over the alphabets 
P 2 ,...,P 3 o, where 

Pi = {0, 1, . . . - 1}. 

Consider the chromosome C represented over the alphabet Pg: 

C = (631751, 8). 
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Consider a mutation occurs on the position 3 in the x part of the chromosome 
and the mutated value of the gene is 4. Then the mutated chromosome is: 

Cl = (634751, 8). 

If the position specifying the alphabet is changed, then the object variables 
will be represented using symbols over the new alphabet, corresponding to the 
mutated value of B. 

Consider again the chromosome C represented over the alphabet B^\ 

C = (631751, 8). 

Consider a mutation occurs on the last position and the mutated value is 5. 
Then the mutated chromosome is: 

C 2 = (23204032, 5). 

C and C 2 encode the same value over two different alphabets (Bg and Biq). 
Remark. A mutation generating an offspring worst than its parent is called a 
harmful mutation. A constant called MAXJ3ARMFUL_MUTATIONS is used 
to determinate when the chromosome part represented the alphabet will be 
changed (mutated). 

4.3 AREA Procedure 

During the initialization stage each AREA individual (chromosome, solution) 
is encoded over a randomly chosen alphabet. Each solution is then selected for 
mutation. If the offspring obtained by mutation is better than its parent than 
the parent is removed from the population and the offspring enters the new 
population. Otherwise, a new mutation of the parent is considered. If the num- 
ber of successive harmful mutations exceeds a prescribed threshold (denoted by 
MAXJ3ARMFULJVIUTATIONS) then the individual representation (alphabet 
part) is changed and with this new representation it enters the new population. 

The reason behind this mechanism is to dynamically change the individual 
representation whenever it is needed. If a particular representation has no po- 
tential for further exploring the search space then the representation is changed. 
In this way we hope that the search space will be explored more efficiently. 

The AREA technique may be depicted as follows. 



AREA technique 



begin 

Set t = 0; 

Random initializes chromosome population P (t); 

Set to zero the number of harmful mutations for each individual in P(t); 
while {t < number of generations) do 

P{t+1) = 0; 

for fc = I to PopSize do 

Mutate the chromosome from P (t). An offspring is obtained. 
Set to zero the number of harmful mutations for offspring; 
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if the offspring is better than the parent then 
the offspring is added to P{t + 1); 
else 

Increase the number of harmful mutations for current individual; 
if the number of harmful mutations for the current individual = 
MAX_HARMFUL_MUTATIONS then 
Change the individual representation; 

Set to zero the number of harmful mutations for the current 
individual; 

Add individual to P{t+1); 
else 

Add current individual (the parent) to P(t+1); 

end if 
end if 
end for; 

Set t = t + 1] 
end while; 
end 



5 Numerical Experiments 

We consider two numerical experiments for solving problem (Pi ) using Construc- 
tive Algorithm and evolutionary technique above described. Results obtained by 
applying AREA technique are compared with the results obtained by applying 
Constructive Algorithm (and after that, Refinement Algorithm). 

The obtained results in the case of a communication network with a speed 
of 6 • 10"*’ bps are presented. The model allows different transfer speed for the 
distinct connections. But in our experiment the transfer speed is assumed to be 
constant for each connection. In the Table 1 and Table 4 appear two cases and 
in every case the number of bits for every relation and the necessary time to 
transfer the relations through the network. 

We have to approximate the size of B', where B' = A txi B and the size of 
C", where C" = P' ex C. 



Table 1. Relation sizes and transfer times for Experiment 1. 





Number of 

bits 


Transfer time 


Relation A 


8,000,000 


133.33s 


Relation B 


4,000,000 


66.66s 


Relation C 


10,000,000 


166.66s 


Relation D 


5,000,000 


83.33s 


Relation B' 


10,400,000 


173.33s 


Relation C' 


25,600,000 


426.66s 
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Table 2. Parameters used by AREA technique. 



AREA Parameters 




Population size 


100 


Number of generations 


100 


Mutation probability 


0,01 


Number of variable 


14 


Number of alphabets 


30 


MAX_HARMFUL_MUTATIONS 


5 



The database management system can take these sizes from the database 
statistics. In our computation we ignore the local processing time, because it is 
unessential compared to the transmission time. 

5.1 Experiment 1 

Consider problem (Pi), where the mean processing times at the four sites are: 

Ti = 66.66po,ii + 166.66po,iiPii,i2 + 83.33po,iiPii,i2Pi2,i3, 
t 2 = 133.33po_2i + 166.66po,2iP2i,32 + 83. 33po, 21^21,32^32, 53, 

T3 = 173.33po,iipii_22 + 173.33po,2iP2i,42 + 83. 33po, 11^11,22^22, 33 (5.2) 

+ 83.33po_2lP21,42P42,73j 

T4 = 426.66po,llPll,12Pl2,23 + 426.66po,llPll,22P22,43 

+ 426.66po,2lP21,32P32,63 + 426.66po,2lP21,42P42,83- 

Parameters used within AREA technique are given in Table 2: 

The results obtained by applying AREA technique and Constructive Algo- 
rithm (and Refinement Algorithm after that) are outlined in Table 3. 

Remark. Final result obtained by AREA technique and CA is very similar. Only 
CPU time is different: CPU time obtained by AREA technique is 0.05s, and the 
CPU time obtained by CA is 11 minutes. 

5.2 Experiment 2 

Consider the experimental conditions given in Table 4. The mean processing 
times in this case at the four sites are: 

Ti = 16.66po,ii + 33.33po,iiPii,i2 + 16.66po,iiPii,i2Pi2,i3i 
T2 = 133.33po,21 + 33.33po,2lP21,32 + 16.66po,2lP21,32P32,53, 

T3 = 173.33po,llPll,22 + 173.33po,2lP21,42 + 16.66po,llPll,22P22,33 (5.3) 

-I- 16.66po_2lP21,42P42,73, 

T4 = 213.33po,llPll,12Pl2,23 + 213.33po,llPll,22P22,43 
-I- 213.33po,2lP21,32P32,63 + 213.33po,2lP21,42P42,83 • 
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Table 3. Solutions obtained by AREA technique and CA for the first set of constants 
considered. 



Transfer 

probabilities 


Solutions 
obtained by 
AREA 


Solutions ob- 
tained by the 
CA + RA 


PO.ll 


0.701 


0.7 


P0,21 


0.298 


0.3 


Pll,12 


0.404 


0.4 


Pll,22 


0.595 


0.6 


P21,32 


0.901 


0.9 


P21,42 


0.098 


0.1 


Pl2,13 


0.513 


0.55 


Pl2,23 


0.486 


0.45 


P22,33 


0.755 


0.75 


P22,43 


0.244 


0.25 


P32,53 


0.970 


0.95 


P32,63 


0.029 


0.05 


P42,73 


0.978 


0.85 


P42,83 


0.0213 


0.15 




106.251 


106.372 



Table 4. Relation sizes and transfer times for Experiment 2. 





Number of 

bits 


Time to trans- 
fer 


Relation A 


8,000,000 


133.33s 


Relation B 


1,000,000 


16.66s 


Relation C 


2,000,000 


33.33s 


Relation D 


1,000,000 


16.66s 


Relation B' 


10,400,000 


173.33s 


Relation C' 


12,800,000 


213.33s 



Table 5. Parameters used by AREA algorithm. 



AREA Parameters 




Population size 


100 


Number of iterations 


10000 


Mutation probability 


0,01 


Number of variables 


14 


Number of alphabets 


30 


MAX_HARMFUL_MUTATIONS 


5 



Parameters used by AREA in this case are given in Table 5. 

The results obtained by applying AREA algorithm and CA + RA are given 
in Table 6. 

Remark. According to Table 6 the final solutions obtained by these two algo- 
rithms are very close. CPU time obtained by AREA technique is 0.05 s, less than 
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Table 6. Solutions obtained by AREA technique and CA for the second set of constants 
considered. 



Transfer 

probabilities 


Solutions 
obtained by 
AREA 


Solutions ob- 
tained by the 
CA + RA 


PO.ll 


0.778 


0.75 


P0,21 


0.221 


0.25 


Pll,12 


0.75 


0.75 


Pll,22 


0.249 


0.25 


P21,32 


0.962 


0.875 


P21,42 


0.037 


0.125 


Pl2,13 


0.75 


1 


Pl2,23 


0.25 


0 


P22,33 


0.996 


0.875 


P22,43 


0.003 


0.125 


P32,53 


0.891 


0.25 


P32,63 


0.108 


0.75 


P42,73 


0.771 


0.875 


P42,83 


0.228 


0.125 


Al 


39.7492 


40.3232 



the CPU time obtained by CA, which is 15 minutes. Evolutionary algorithms 
seem to be useful technique for practical optimization proposes. 



6 Stochastic Model Versus Heuristic Strategy 

We compare our stochastic optimization model and a very popular transfer 
heuristic. According to the transfer heuristic (see [9]) the smallest relation from 
the operands of a join is transfered to the other operand relation. A query against 
a database is executed several times (not only once). Stochastic model takes it 
into account and tries to share the execution of the same query between the sites 
of the network. 

We say that a strategy is “pwre” if the execution path of the respective query 
in the state-transition graph is the same in every case the query is executed. If 
the query is executed several times, one of the joins of the query is executed in 
each case by the same site, and this is valid for each join of the query. 

In the following we compare the results of the proposed stochastic model and 
the results given by a “pure” strategy. 

In step 1 of Experiment 1 the transmission of relation B is chosen several 
times, because its size is smaller than the size of relation A. Therefore the system 
undergoes transition from state sq in state sn in 7 cases from 10. 

Transition from state Sn states Si 2 and S 22 are chosen in a balanced mode. 
This because the size of relation B' {B' = A ixi i?) is approximately equal to the 
size of relation C. 



Solving Stochastic Optimization in Distributed Databases 273 



This balance is not so evident from state S21. This can be explained by the 
heuristic character of the methods proposed in this paper. 

In the third step of query strategy, which is the join of relation D with the 
resulted relation C {C = B' ixi C.), nearly in every case relation D is chosen for 
transfer. This because the size of D is much smaller than the size of relation C . 

Consider the ’’^pure” strategy (deduced from transfer heuristic): sq, sn, S12, 
S13. Every join is executed in site 1, and the necessary time is: 66.66s + 166.66s 
+ 83.33s = 316.65s, which is much greater than the mean processing time given 
by the stochastic query optimization model (i.e. 106.251s). 

In Experiment 2 the situation is similar. As the size of relation B is smaller 
than in case of Experiment 1 the transfer of B is chosen more often than in case 
of Experiment 1 . 

In step 3 in most cases the transfer of relation D is chosen. The size of D 
is much smaller than the size of the result relation C . In this case a ^^pure’^ 
strategy with the same transfer heuristic may be Sqi Siii S12, Si3- The necessary 
time for this “pure” strategy in site 1 is: 16.66s + 33.33s + 16.66s = 66.65s, which 
is greater than the mean processing time given by the stochastic optimization 
model (i.e. 39.7492s ). 

7 Conclusions 

Stochastic query optimization problem of four relations stored in four different 
sites leads to a constrained nonlinear optimization problem. For solving this 
problem two different approaches are considered: a constructive (exhaustive) one 
and an evolutionary one. The results obtained by applying these two methods 
are very similar. The difference consist in CPU time: by considering evolutionary 
method for solving the problem the execution time is less than the running time 
obtained by applying the constructive method. Evolutionary approach seems 
thus to be more suitable for solving real world applications in real time. 

Acknowledgments. We are grateful to professors A. Benczur and Cs. Varga 
for their valuable suggestions. 
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Abstract. The one- two phase commit (1-2PC) protocol is a combina- 
tion of a one-phase atomic commit protocol, namely, implicit yes-vote, 
and a two-phase atomic commit protocol, namely, presumed commit. 
The 1-2PC protocol integrates these two protocols in a dynamic fashion, 
depending on the behavior of transactions and system requirements, in 
spite of their incompatibilities. This paper extends the applicability of 
1-2PC to the multi-level transaction execution model, which is adopted 
by database standards. Besides allowing incompatible atomic commit 
protocols to co-exist in the same environment, 1-2PC has the advan- 
tage of enhanced performance over the currently known atomic commit 
protocols making it more suitable for Internet database applications. 



1 Introduction 

The two-phase commit (2PC) protocol [9,12] is one of the most widely used 
and optimized atomic commit protocols (ACPs). It ensures atomicity and inde- 
pendent recovery but at a substantial cost during normal transaction execution 
which adversely affects the performance of the system. This is due to the costs 
associated with its message complexity (i.e., the number of messages used for co- 
ordinating the actions of the different sites) and log complexity (i.e., the amount 
of information that needs to be stored in the stable storage of the participating 
sites for failure recovery). For this reason, there has been a re-newed interest 
in developing more efficient ACPs and optimizations. This is especially impor- 
tant given the current advances in electronic services and electronic commerce 
environments that are characterized by high volume of transactions where com- 
mit processing overhead is more pronounced. Most notable results that aim at 
reducing the cost of commit processing are one-phase commit (IPC) protocols 
such as implicit yes-vote (lYV) [4,6] and coordinator log (CL) [19]. 

Although IPC protocols are, in general, more efficient than 2PC protocols, 
IPC protocols place assumptions on transactions or the database management 
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systems (DBMSs). Whereas some of these assumptions are realistic (i.e., reflect 
how DBMSs are usually implemented), others can be considered restrictive in 
some applications [1,6]. For example, IPC protocols restrict the implementation 
of applications that wish to utilize deferred eonsisteney constraints validation, 
an option that is specified in the SQL standards. 

The one-two phase commit (1-2PC) protocol attempts to achieve the best 
of the two worlds. Namely, the performance of IPC and the wide applicability 
of 2PC. It is essentially a combination of IPC (in particular, lYV) and 2PC 
(in particular. Presumed Commit - PrC [16]). It starts as IPC and dynamically 
switches to 2PC when necessary. Thus, 1-2PC achieves the performance advan- 
tages of IPC protocols whenever possible and, at the same time, the wide appli- 
cability of 2PC protocols. In other words, 1-2PC supports deferred constraints 
without penalizing those transactions that do not require them. Furthermore, 
1-2PC achieves this advantage on a participant (cohort) basis within the same 
transaction in spite of the incompatibilities between the IPC and 2PC protocols. 

This paper extends the applicability of 1-2PC to the multi-level transaction 
execution (MLTE) model, the one adopted by database standards and imple- 
mented in commercial systems. The MLTE model is specially important in the 
context of Internet transactions since they are hierarchical in nature, making 
1-2PC more suitable for Internet database applications. 

In Section 2, we review PrC, lYV and 1-2PC. Multi-level 1-2PC is introduced 
in Section 3. The performance of 1-2PC is analytically evaluated in Section 4. 

2 Background 

A distributed/Internet transaction accesses data by submitting operations to its 
coordinator. The coordinator of a transaction is assumed to be the transaction 
manager at the site where the transaction is initiated. Depending on the data 
distribution, the coordinator decomposes the transaction into a set of subtrans- 
actions, each of which executes at a single participating database site (cohort). 

In the multi-level transaction execution (MLTE) model, it is possible for a 
cohort, to decompose its assigned subtransactions further. Thus, a transaction 
execution can be represented by a multi-level execution tree with its coordi- 
nator at the root, and with a number of intermediate and leaf cohorts. When 
the transaction finishes its execution and submits its final commit request, the 
coordinator initiates an atomic commit protocol. 

2.1 Presumed Commit Two-Phase Commit Protocol 

Presumed Commit (PrC) [16] is one of the best known variants of the two-phase 
commit protocol which consist of a voting phase and a decision phase. During the 
voting phase, the coordinator requests all cohorts to prepare to commit whereas, 
during the decision phase, the coordinator either commits the transaction if all 
cohorts are prepared-to-commit (voted “yes”), or aborts the transaction if any 
cohort has decided to abort (voted “no”). 
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In general, when a cohort receives the final decision and complies with the 
decision, it sends an acknowledgment (ACK). ACKs enable a coordinator to 
discards all information pertaining to a transaction from its protocol table (that 
is kept in main memory), and forgets the transaction. Once the coordinator 
receives ACKs from all the cohorts, it knows that all cohorts have received the 
decision and none of them will inquire about the status of the transaction in 
the future. In PrC, cohorts ACK only abort decisions and not commit ones. A 
coordinator removes a transaction from its protocol table either when it makes 
a commit decision or when it receives ACKs from all cohorts in the case of abort 
decision. This means that in case of a status inquiry, a coordinator can interpret 
lack of information on a transaction to indicate a commit decision. 

In PrC, misinterpretation of missing information as a commit after a coordi- 
nator’s failure is avoided by requiring coordinators to record in a force written 
initiation log record all the cohorts for each transaction before sending prepare 
to commit messages to the cohorts. To commit a transaction, the coordinator 
force writes a commit record to logically eliminate the initiation record of the 
transaction and then sends out the commit decision. When a cohort receives 
the decision, it writes a non-forced commit record and commits the transaction 
without having to ACK the decision. After a coordinator or a cohort failure, if 
the cohort inquires about a committed transaction, the coordinator, not remem- 
bering the transaction, will direct the cohort to commit it (by presumption). 

To abort a transaction, the coordinator does not write an abort decision in 
its log. Instead, it sends out the abort decision and waits for ACKs. When a 
cohort receives the decision, it force writes an abort record and sends an ACK. 

In the MLTE model, the behavior of the root coordinator and each leaf cohort 
remains the same as in two-level transactions. The only difference is the behavior 
of cascaded coordinators (i.e., non-root and non-leaf cohorts) which behave as leaf 
cohorts with respect to their direct ancestors and root coordinators with respect 
to their direct descendants. In multi-level PrC, each cascaded coordinator has 
to force write an initiation record before propagating the prepare to commit 
message to its descendant cohorts. On abort decision, it notifies its descendants, 
force writes an abort record and, then, acknowledge its ancestor. It forgets the 
transaction when it receives ACKs from all its descendants. On commit decision, 
a cascaded coordinator propagates the decision to its descendants, writes a non- 
forced commit record and, then, forgets the transaction. 

2.2 Implicit Yes- Vote One-Phase Commit Protocol 

Unlike PrC, the implicit yes-vote (lYV) [4,6] protocol consist of only a single 
phase which is the decision phase. The (explicit) voting phase is eliminated by 
overlapping it with the ACKs of the database operations. lYV assumes that each 
site deploys (1) a strict two-phase locking and (2) physical page-level replicated- 
write-ahead logging with the undo phase preceding the redo phase for recovery. 

In lYV, when the coordinator of a transaction receives an ACK from a cohort 
regarding the completion of an operation, the ACK is implicitly interpreted to 
mean that the transaction is in a prepared-to-commit state at the cohort. When 
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the cohort receives a new operation for execution, the transaction becomes active 
again at the cohort and can be aborted, for example, if it causes a deadlock or 
violation to any of the site’s database consistency constraints. If the transaction 
is aborted, the cohort responds with a negative ACK message (NACK). Only 
when all the operations of the transaction are executed and acknowledged by 
their perspective cohorts, the coordinator commits the transaction. Otherwise, 
it aborts the transaction. In either case, the coordinator propagates its decision 
to all the cohorts and waits for their ACKs. 

lYV handles cohort failures by partially replicating its log rather than force 
writing the log before each ACK. Each cohort includes the redo log records that 
are generated during the execution of an operation in the operation’s ACK. Each 
cohort also includes the read locks acquired during the execution of an operation 
in the ACK in order to support the option of forward recovery [6] . After a crash, 
a cohort reconstructs the state of its database, which includes its log and lock 
table as it was just prior to the failure with the help of the coordinators. To limit 
the number of coordinators that need to be contacted after a site failure, each 
cohort maintains a recovery-coordinators’ list (RCL) which is kept in the stable 
log. At the same time, by maintaining a local log and using WAL, each cohort 
is able to undo the effects of aborted transactions locally using only its own log. 

In multi-level lYV, the behavior of a root coordinator and leaf cohorts re- 
mains the same as in lYV, whereas cascaded coordinators are responsible about 
the coordination of ACKs of individual operations. 

As in the case of the (two-level) lYV, only a root coordinator maintains a 
replicated redo log for each of the cohorts. When a cascaded coordinator re- 
ceives ACKs from all its descendants that participated in the execution of an 
operation, it sends an ACK to its direct ancestor containing the redo log records 
generated across all cohorts and the read locks held at them during the execu- 
tion of the operation. Thus, after the successful execution of each operation, root 
coordinator knows all the cohorts (i.e., both leaf and cascaded coordinators) in 
a transaction. Similarly, each cohort knows the identity of the root coordinator 
which is reflected in its RCL. The identity of the root coordinator is attached to 
each operation send by the root and cascaded coordinators. 

While the execution phase of a transaction is multi-level, the decision phase 
is not. Since the root coordinator knows all the cohorts at the time the trans- 
action finishes its execution it sends its decision directly to each cohort without 
going through cascaded coordinators. Similarly, each cohort sends its ACK of 
the decision directly to the root coordinator. This is similar to the flattening of 
the commit tree optimization [17]. 

2.3 The 1-2PC Protocol 

1-2PC is a composite protocol that inter-operates lYV and PrC in a practical 
manner in spite of their incompatibilities. In 1-2PC, a transaction starts as IPC 
at each cohort and continuous this way until the cohort executes a deferred 
consistency constraint. When a cohort executes such a constraint, it means that 
the constraint needs to be synchronized at commit time. For this reason, the 
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cohort switches to 2PC and sends an unsolicited deferred consistency constraint 
(UDCC) message to the coordinator. The UDCC is a flag that is set as part 
of a switch message, which also serves as an ACK for the operation’s successful 
execution. When the coordinator receives the switch message, it switches the 
protocol used with the cohort to 2PC. 

When a transaction sends its final commit primitive, the coordinator knows 
which cohorts are IPC and which cohorts are 2PC. If all cohorts are IPC (i.e., 
no cohort has executed deferred constraints), the coordinator behaves as an lYV 
coordinator. On the other hand, if all cohorts are 2PC, the coordinator behaves 
as a PrC coordinator with the exception that the initiation log record (of PrC) 
is now called a switch log record. 

When the cohorts are mixed IPC and 2PC in a transaction’s execution, the 
coordinator resolves the incompatibilities between the two protocols as follows: 
(1) It “ talks” lYV with IPC cohorts, and PrC with 2PC cohorts and (2) ini- 
tiates the voting phase with 2PC cohorts before making the final decision and 
propagating the final decision to all cohorts. This is because a “no” vote from a 
2PC cohort is a veto that aborts a transaction. Further, in order to be able to 
reply to the inquiry messages of the cohorts after failures, 1-2PC synchronizes 
the timing at which it forgets the outcome of terminated transactions. A coor- 
dinator forgets the outcome of a committed transaction when all IPC cohorts 
ACK the commit decision, and the outcome of an aborted transaction when all 
2PC cohorts ACK the abort decision. In this way, when a cohort inquires about 
the outcome of a forgotten transaction, the coordinator replies with a decision 
that matches the presumption of the protocol used by the cohort which is always 
consistent with the actual outcome of the transaction. 

1-2PC has been optimized for read-only transactions and for context-free 
transactions with a forward recovery option [3] but never extended for multi- 
level transactions which is done in the next section. 

3 The Multi-level 1-2PC Protocol 

Extending the 1-2PC for multi-level transactions, there are three cases to con- 
sider: (1) all cohorts are IPC, (2) all cohorts are 2PC and (3) cohorts are mixed 
IPC and 2PC. We discuss each of these cases in the following three sections. 

3.1 All Cohorts Are IPC 

In the multi-level 1-2PC, the behavior of the root coordinator and each leaf co- 
hort in the transaction execution tree remains the same as in two-level 1-2PC. 
The only difference is the behavior of cascaded coordinators which is similar to 
that of the cascaded coordinators in the multi-level lYV. Since an operation’s 
ACK represents the successful execution of the operation at the cascaded coordi- 
nator and all its descendants that have participated in the operation’s execution, 
the cascaded coordinator has to wait until it receives ACKs from the required de- 
scendants before sending the (collective) ACK and redo log records to its direct 
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coordinator in the transaction execution tree. Thus, when a transaction finishes 
its execution, all its redo records are replicated at the root coordinator’s site. As 
in the two-level 1-2PC, only root coordinators are responsible for maintaining 
the replicated redo log records and a root coordinator knows all the cohorts (i.e., 
both leaf and cascaded coordinators). 

The identity of the root coordinator is attached to each operation send by the 
root and cascaded coordinators. When a cohort receives an operation from a root 
coordinator for the first time, it records the coordinator’s identity in its RCL 
and force writes its RCL into stable storage. A cohort removes the identity of a 
root coordinator from its RCL, when it commits or aborts the last transaction 
submitted by the root coordinator. 

As in lYV, if a cohort fails to process an operation, it aborts the transaction 
and sends a NACK to its direct ancestor. If the cohort is a cascaded coordina- 
tor, it also sends an abort message to each implicitly prepared cohort Then, the 
cohort forgets the transaction. When the root or a cascaded coordinator receives 
NACK from a direct descendant, it aborts the transaction and sends abort mes- 
sages to all direct descendants and forgets the transaction. The root coordinator 
behaves similarly when it receives an abort request from a transaction. 

On the other hand, if the root coordinator receives a commit request from 
the transaction after the successful execution of all its operations, the coordina- 
tor commits the transaction. On a commit decision, the coordinator force writes 
a commit log record and then sends commit messages to each of its direct de- 
scendants. If a descendant is a leaf cohort, it commits the transaction, writes a 
non-forced log record and, when the log record is fiushed into the stable log, it 
acknowledges the commit decision. 

If the cohort is a cascaded coordinator, the cohort commits the transaction, 
forwards a commit message to each of its direct descendants and writes a non- 
forced commit log record. When the cascaded coordinator receives ACKs from 
all its direct descendants and the commit log record that it wrote had been 
fiushed into the stable log, the cohort acknowledges the commit decision to its 
direct ancestor. Thus, the ACK serves as a collective ACK for the entire cascaded 
coordinator’s branch. 

3.2 All Cohorts Are 2PC 

At the end of the transaction execution phase, the coordinator declares the 
transaction as 2PC if all cohorts have switched to 2PC. When all cohorts are 
2PC, 1-2PC can be extended to the MLTE model in a manner similar to the 
multi-level PrC which we briefly discussed in Section 2.1 and detailed in [5]. 
However, multi-level 1-2PC is designed in such a way that 1-2PC does not realize 
the commit presumption of PrC on every two adjacent levels of the transaction 
execution tree. In this respect, it is similar to the rooted PrC which reduces the 
cost associated with the initiation records of PrC [5] . 

Specifically, cascaded coordinators do not force write switch records which 
are equivalent to the initiation records of PrC and, consequently, do not presume 
commitment in the case that they do not remember transactions. For this reason. 
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in multi-level 1-2PC, the root coordinator needs to know all the cohorts at all 
levels in a transaction’s execution tree. Similarly, each cohort needs to know 
all its ancestors in the transaction’s execution tree. The former allows the root 
coordinator to determine when it can safely forget a transaction while the latter 
allows a prepared to commit cohort at any level in a transaction’s execution 
tree to find out the final correct outcome of the transaction, even if intermediate 
cascaded coordinators have no recollection about the transaction due to a failure. 

In order for the root coordinator to know the identities of all cohorts, each 
cohort includes its identity in the ACKs of the first operation that it executes. 
When a cascaded coordinator receives such an ACK from a cohort, it also in- 
cludes its identity in the ACK. In this way, the identities of all cohorts and the 
chain of their ancestors are propagated to the root coordinator. When the trans- 
action submits its commit request, assuming that that all cohorts have requested 
to switch to 2PC during the execution of the transaction, the coordinator force 
writes a switch record, as in two-level 1-2PC. The switch log record includes 
the identities of all cohorts in the transaction execution tree. Then, it sends out 
prepare to commit messages to its direct descendants. 

When the coordinator sends the prepare to commit message, it includes its 
identity in the message. When a cascaded coordinator receives the prepare to 
commit message, it appends its own identity to the message before forwarding 
it to its direct descendants. When a leaf cohort receives a prepare to commit 
message, it copies the identities of its ancestors in the prepared log record before 
sending its “Yes” vote. When a cascaded coordinator receives “Yes” votes from 
all its direct descendants, the cascaded coordinator also records the identities of 
its ancestors as well as its descendants in its prepared log record before sending 
its collective “Yes” vote to its direct ancestor. 

If any direct descendant has voted “No”, the cascaded coordinator force 
writes an abort log record, sends a “No” vote to its direct ancestor and an 
abort message to each direct descendant that has voted “Yes” and waits for 
their ACKs. Once all the abort ACKs arrive, the cascaded coordinator writes a 
non-forced end record and forgets the transaction. 

As in multi-level PrC, when the root coordinator receives “Yes” votes from 
all its direct descendants, it force writes a commit record, sends its decision to 
its direct descendants and forgets the transaction. When a cascaded coordinator 
receives a commit message, it commits the transaction, propagates the message 
to its direct descendants, writes a non-forced commit record and forgets the 
transaction. When a leaf cohort receives the message, it commits the transaction 
and writes a non-forced commit record. 

If the root coordinator receives a “No” vote, it sends an abort decision to 
all direct descendants that have voted “Yes” and waits for their ACKs, knowing 
that all the descendants of a direct descendant that has voted “No” have already 
aborted the transaction. When the coordinator receives all the ACKs, it writes a 
non-forced end record and forgets the transaction. When a cascaded coordinator 
receives the abort message, it behaves as in multi-level PrC. That is, it prop- 
agates the message to its direct descendants and writes a forced abort record. 
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Fig. 1. Mixed cohorts in a 2PC cascaded coordinator’s branch (commit case). 

Then, it acknowledges its direct ancestor. Once the cascaded coordinator has 
received ACKs from all its direct descendants, it writes a non-forced end record 
and forgets the transaction. When a leaf cohort receives the abort message, it 
first force writes an abort record and, then, acknowledges its direct ancestor. 



3.3 Cohorts Are Mixed IPC and 2PC 

Based on the information received from the different cohorts during the execu- 
tion of a transaction, at commit time the coordinator of the transaction knows 
the protocol of each of the cohorts. It also knows the execution tree of the trans- 
action. That is, it knows all the ancestors of each cohort and whether a cohort is 
a cascaded coordinator or a leaf cohort. Based on this knowledge, the coordina- 
tor considers a direct descendant to be IPC if the descendant and all the cohorts 
in its branch are IPC, and 2PC if the direct descendant or any of the cohorts 
in its branch is 2PC. For a IPC branch, the coordinator uses the IPC part of 
multi-level 1-2PC with the branch, as we discussed above (Section 3.1). For a 
2PC branch, the coordinator uses 2PC regardless of whether the direct descen- 
dant is IPC or 2PC. That is, the coordinator uses the 2PC part of multi-level 
1-2PC discussed in the previous section (Section 3.2). Thus, with the exception 
in the way a coordinator’s decide on which protocol to use with each of its direct 
descendants, the coordinator’s protocol proceeds as in the two-level 1-2PC. 

For leaf cohorts, each cohort behaves exactly in the same way as in two-level 
1-2PC regardless of whether the leaf cohort descends from a IPC or 2PC branch. 
That is, a cohort behaves as IPC cohort if it has not requested to switch protocol 
or as 2PC if has made such a request during the execution of the transaction. 
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Fig. 2. Mixed cohorts in a 2PC cascaded coordinator’s branch (abort case). 



On the other hand, the behavior of cascaded coordinators is different and 
depends on the types of its descendant cohorts in the branch. A cascaded coor- 
dinator uses multi-level IPC when all the cohorts in its branch, including itself, 
are IPC. Similarly, a cascaded coordinator uses multi-level 2PC when all the co- 
horts in the branch, including itself, are 2PC. Thus, in the above two situations, 
a cascaded coordinator uses multi-level 1-2PC as we discussed it in the previous 
two sections, respectively. 

When the protocol used by a cascaded coordinator is different than the proto- 
col used by at least one of its descendants (not necessarily a direct descendant), 
there are two scenarios to consider. Since, for each scenario, cascaded coordina- 
tors behave the same way at any level of the transaction execution tree, below 
we discuss the case of the last cascaded coordinator in a branch. 



2PC cascaded coordinator with IPC cohort(s). When a 2PC cascaded 
coordinator receives a prepare message from its ancestor after the transaction 
has finished its execution, the cascaded coordinator forwards the message to 
each 2PC cohort and waits for their votes. If any cohort has decided to abort, 
the cascaded coordinator force writes an abort log record, then, sends a “no” 
vote to its direct ancestor and an abort message to each prepared cohort 
(including IPC cohorts). Then, it waits for the ACKs from the prepared 2PC 
cohorts. Once it receives the required ACKs, it writes a non-forced end log 
record and forgets the transaction. On the other hand, if all the 2PC cohort 
have voted “yes” and the cascaded coordinator’s own vote is a “yes” vote too, 
the cascaded coordinator force writes a prepared log record and then sends 
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Fig. 3. Mixed cohorts in a IPC cascaded coordinator’s branch (commit case). 
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Fig. 4. Mixed cohorts in a IPC cascaded coordinator’s branch (abort case). 



a (collective) “yes” vote of the branch to the its direct ancestor, as shown in 
Figure 1. Then, it waits for the final decision. 

If the final decision is a commit (Figure 1), the cascaded coordinator forwards 
the decision to each of its direct descendants (both IPC and 2PC), and writes a 
commit log record. The commit log record of the cascaded coordinator is written 
in a non-forced manner, following PrC protocol. Unlike PrC, however, a cascaded 
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coordinator expects each IPC cohort to acknowledge the commit message but 
not 2PC cohorts since they follow PrC. When a cascaded coordinator receives 
ACKs from IPC cohorts, it writes a non-forced end log record. When the end 
record is written into the stable log due to a subsequent forced write of a log 
record or log buffer overflow, the cascaded coordinator sends a collective ACK 
to its direct ancestor and forgets the transaction. 

On the other hand, if the final decision is an abort (Figure 2), the cascaded 
coordinator sends an abort message to each of its descendants and writes a 
forced abort log record (following PrC protocol). When 2PC cohorts acknowl- 
edge the abort decision, the cascaded coordinator writes a non-forced end log 
record. Once the end record is written onto stable storage due to a subsequent 
flush of the log buffer, the cascaded coordinator sends an ACK to its direct 
ancestor and forgets the transaction. 

Notice that, unlike two-level 1-2PC, a 2PC cohort that is cascaded coordi- 
nator has to acknowledge both commit and abort decisions. A commit ACK 
reflects the ACKs of all IPC cohorts while an abort ACK reflects the ACKs of 
all 2PC cohorts (including the cascaded coordinator’s ACK). 



IPC cascaded coordinator with 2PC cohort (s). As mentioned above, a 
IPC cascaded coordinator with 2PC cohorts is dealt with as 2PC with respect 
to messages. Specifically, when a IPC cascaded coordinator receives a prepare 
message from its ancestor, it forwards the message to each 2PC cohort and waits 
for their votes. If any cohort has decided to abort, the cascaded coordinator force 
writes an abort log record, then, sends a “no” vote to its direct ancestor and 
an abort message to each prepared cohort (including IPC cohorts). Then, it 
waits for the abort ACKs from the prepared 2PC cohorts. Once the cascaded 
coordinator receives the required ACKs, it writes a non-forced end log record 
and forgets the transaction. On the other hand, if all the 2PC cohort have voted 
“yes”, the cascaded coordinator sends a (collective) “yes” vote of the branch to 
the its direct ancestor, as shown in Figure 3, and waits for the final decision. 

If the final decision is a commit (Figure 3), the cascaded coordinator forwards 
the decision to each of its direct descendants (both IPC and 2PC), and writes a 
commit log record. The commit log record of the cascaded coordinator is written 
in a non-forced manner, following lYV protocol. Unlike lYV, however, a cascaded 
coordinator expects each IPC cohort to acknowledge the commit message but 
not 2PC cohorts since they follow PrC. When a cascaded coordinator receives 
ACKs from IPC cohorts, it writes a non-forced end log record. When the end 
record is written into the stable log due to a subsequent forced write of a log 
record or log buffer overflow, the cascaded coordinator sends a collective ACK 
to its direct ancestor and forgets the transaction. 

On the other hand, if the final decision is an abort (Figure 4), the cascaded 
coordinator sends an abort message to each of its descendants and writes a non- 
forced abort log record (following lYV protocol). When 2PC cohorts acknowledge 
the abort decision, the cascaded coordinator writes a non-forced end log record. 
Once the end record is written onto stable storage due to a subsequent flush to 



286 Y.J. Al-Houmaily and P.K. Chrysanthis 



the log buffer, the cascaded coordinator sends an ACK to its direct ancestor and 
forgets the transaction. 

As in the case of a 2PC cascaded coordinator with mixed cohorts, a IPC 
cohort that is cascaded coordinator has to acknowledge both commit as well as 
abort decisions. A commit ACK reflects the ACKs of all IPC cohorts (including 
the cascaded coordinator’s ACK) while an abort ACK reflects the ACKs of all 
2PC cohorts. 

3.4 Recovering from Failures 

As in all other atomic commit protocols, site and communication failures are 
detected by timeouts. If the root coordinator times out while awaiting the vote 
of one of its direct descendants, it makes an abort final decision, sends abort 
messages to all its direct descendants and wait for their ACKs to complete the 
protocol. Similarly, if a cascaded coordinator times out while awaiting the vote 
of one of its direct descendants, it makes an abort decision. It also force writes 
an abort log record, sends a “no” vote to its direct ancestor and abort messages 
to all its direct descendants and waits for their abort ACKs. 

After a site failure, during its recovery process, a 2PC leaf cohort inquires its 
direct ancestor about the outcome of each prepared to commit transaction. In 
its inquiry message, the cohort includes the identities of its ancestors recorded in 
the prepared log record. In this way, if the direct ancestor of the prepared cohort 
does not remember the transaction, it uses the list of ancestors included in the in- 
quiry message to inquire its own direct ancestor about the transaction’s outcome 
rather than replying with a commit message by presumption. (Recall that a 2PC 
cascaded coordinator does not write initiation records for transactions, therefore, 
it cannot presume commitment in the absence of information about a transac- 
tion.) Eventually, either one of the cascaded coordinators in the path of ancestors 
will remember the transaction and provide a reply, or the inquiry message will 
finally reach the root coordinator. The root coordinator will respond with the 
appropriate decision if it remembers the outcome of the transaction or will re- 
spond with a commit decision by presumption. Once the cohort receives the reply 
message, it enforces the decision and sends an ACK only if the decision is abort. 

On the other hand, if the leaf cohort is a IPC cohort, the cohort uses its 
list of RCL to resolve the status of those transactions that were active prior to 
the failure, as in lYV. Specifically, the cohort inquires each of the coordinators 
recorded in its RCL with a recovering message. Once the repair messages arrive 
from the listed coordinators, the cohort repairs its log by applying the missing 
redo records and finish its recovery procedure. If the failure is a communication 
failure and the cohort is left blocked in an implicit prepared state, the cohort 
keeps inquiring its direct ancestor until it receives a final decision. Once the final 
decision arrives, the cohort continues its protocol as during normal processing. 

In the event that the root coordinator fails, during its recovery process, the 
root coordinator identifies and records in its protocol table each transaction with 
a switch log record without a corresponding commit or end record. These trans- 
actions have not finished their commit processing by the time of the failure and 
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need to be aborted. For each of these transactions, the coordinator sends an 
abort message to its direct descendants, as recorded in the switch record, along 
with their lists of descendants in the transaction execution tree. The recipient of 
the abort message can be either a cascaded coordinator or a leaf cohort. In the 
case of a cascaded coordinator, if it is in a prepared-to-commit state, the cas- 
caded coordinator behaves as in the case of normal processing discussed above. 
Otherwise, it responds with a blind ACK, indicating that it has already aborted 
the transaction. Similarly, if the abort message is received by a leaf cohort, the 
cohort behaves as in the case of normal processing if it is in a prepared-to-commit 
state or replies with a blind ACK. 

Similarly, for each transaction with each that has a commit log record but 
without corresponding switch and end record, the coordinator knows that all 
cohorts in this transaction execution are IPC and the transaction has not finished 
the protocol before the failure. For each of these transactions, the coordinator 
adds the transaction in its protocol table and sends a commit message to each 
of its direct ancestors. Then, the coordinator waits for the ACKs of the direct 
descendants. Once the required ACKs arrive, the coordinator writes an end log 
record and forgets the transaction. 

In the case of a 2PC cascaded coordinator failure, during the recovery process, 
the cascaded coordinator adds to its protocol table each undecided transaction 
(i.e., a transaction that has a prepared record without a corresponding final 
decision record) and each decided (i.e., committed or aborted) transaction that 
has not been fully acknowledged by its direct descendants prior to the failure. For 
each undecided transaction, the cascaded coordinator inquires its direct ancestor 
about the outcome of the transaction. As in the case of a leaf cohort failure, 
the inquiry message contains the identities of all ancestors as recorded in the 
prepared record. Once the cascaded coordinator receives the final decision, it 
completes the protocol as in the normal processing case discussed above. For 
each decided but not fully acknowledged transaction, the cascaded coordinator 
re-sends decision messages to its direct descendants (according to the protocol 
specification) and waits for all their ACKs before completing the protocol as 
during normal processing, e.g., by writing a non-forced end log record. 



4 Analytical Evaluation 

In this section, we evaluate the performance of 1-2PC and compare it with the 
performance of PrC and lYV. In our evaluation, we also include the presumed 
abort (PrA) protocol [16] which is the other best known 2PC variant. As opposed 
to PrC, PrA coordinators assume that lack of information on a transaction 
indicates an aborted transaction. This eliminates the need for an abort log record 
at the coordinator and the need of an ACK and force write abort decision log 
records at the cohorts. 

Our evaluation method is based on evaluating the log, message and time 
(message delay) complexities. In the evaluation, we consider the number of co- 
ordination messages and forced log writes that are due to the protocols only 
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Table 1. The cost of the protocols to commit a transaction. 





PrC 


PrA 


lYV 


1-2PC 

(IPC) 


1-2PC 

(2PC) 


1-2PC 

(MIX) 


Log force delays 


2d-l 


d 


1 


1 


d+1 


3 ■ (d+1) 


Total forced log writes 


2c-\-l-\-2 


2c+2^+l 


1 


1 


C-I-/+2 


n-p-\-2 


Message delays (Commit) 


2{d-l) 


2(d-l) 


0 


0 


2(d-l) 


2 ~ (2d-l) 


Message delays (Locks) 


3(d-l) 


3(d-l) 


d-1 


d-1 


3(d-l) 


(2+d) ~ (3(d-l) 


Total messages 


3n 


4n 


2n 


2n 


3n 


4cMia;+2piPC+3p2PC 


Total messages with piggybacking 


3n 


3n 


n 


n 


3n 


3CMii+2piPC+3p2PC 



Table 2. The cost of the protocols to abort a transaction. 
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(e.g., we do not consider the number of messages that are due to the operations 
and their acknowledgments). The costs of the protocols in both commit and 
abort cases are evaluated during normal processing. 

Tables 1 and 2 compare the costs of the different protocols for the commit 
case and abort case on a per transaction basis, respectively. The column titled 
“1-2PC (IPC)” denotes the 1-2PC protocol when all cohorts are IPC, whereas 
the column titled “1-2PC (2PC)” denotes the 1-2PC protocol when all cohorts 
are 2PC. The column titled “1-2PC (MIX)” denotes the 1-2PC protocol in the 
presence of a mixture of both IPC and 2PC cohorts. In the table, n denotes the 
total number of sites participating in a transaction’s execution (excluding the 
coordinator’s site), p denotes the number of IPC cohorts (in the case of 1-2PC 
protocol), c denotes a cascaded coordinator, I denotes a leaf cohort and d denotes 
the depth of the transaction execution tree assuming that the root coordinator 
resides at level “1”. 

The row labeled “Log force delays” contains the sequence of forced log 
writes that are required by the different protocols up to the point that the 
commit/abort decision is made. The row labeled “Message delays (Decision)” 
contains the number of sequential messages up to the commit/abort point, and 
the row labeled “Message delays (Locks)” contains the number of sequential 
messages that are involved in order to release all the locks held by a commit- 
ting/aborting transaction at the cohorts’ sites. In the row labeled “Total mes- 
sages with piggybacking”, we apply piggybacking of the ACKs, which is a special 
case of the lazy commit optimization to eliminate the final round of messages. 

It is clear from Tables 1 and 2 that 1-2PC performs as lYV when all cohorts 
are IPC cohorts, outperforming the 2PC variants in all performance measures 
including the number of log force delays to reach a decision as well as the total 
number of log force writes. For the commit case, the two protocols require only 
one forced log write whereas for the abort case neither 1-2PC nor lYV force 
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write any log records. When all cohorts are 2PC, 1-2PC performs by about d 
less in the number of sequential forced log writes and c less in the total forced log 
writes for both the commit as well as the abort case. This makes the performance 
enhancement of 1-2PC much more significant in the presence of deep execution 
trees. This performance enhancement is reflected on the 1-2PC when there is 
a cohorts’ mix where the costs associated with log force delays, message delays 
to reach a decision and message delays to commit depends on the number of 
sequential 2PC cohorts as well as their positions in the execution tree. 

Piggybacking can be used to eliminate the final round of messages for the 
commit case in PrA, lYV and 1-2PC (IPC). That is not the case for PrC, 
and 1-2PC (2PC) because a commit decision is never acknowledged in these 
protocols. Similarly, this optimization can be used in the abort case with PrC 
and 1-2PC (2PC) but not with PrA, lYV or 1-2PC (IPC) since a cohort in 
the latter set of protocols never acknowledges an abort decision. 1-2PC (MIX) 
benefits from this optimization in both commit and abort cases. This is because, 
in a commit case, a IPC leaf cohort and each cascaded coordinator with mixed 
cohorts acknowledge the commit decision, whereas in an abort case, a 2PC leaf 
cohort and each cascaded coordinator with mixed cohorts acknowledge the abort 
decision, which can be both piggybacked. 

Finally, the performance of multi-level 1-2PC can be further enhanced by 
applying three optimizations: read-only, forward recovery and flatting of the 
commit tree. These were not discussed in this paper due to space limitations. 



5 Conclusions 



Recently, there has been a re-newed interest in developing new atomic com- 
mit protocols for different database environments. These environments include 
gigabit-networked, mobile and real-time database systems. The aim of these ef- 
forts is to develop new and optimized atomic commit protocols that meet the 
special characteristics and limitations of each of these environments. 

The 1-2PC protocol was proposed to achieve the performance of one-phase 
commit protocols when they are applicable, and the (general) applicability of 
two-phase commit protocols, otherwise. The 1-2PC protocol clearly alleviates 
the applicability shortcomings of IPC protocols in the presence of (1) deferred 
consistency constraints, or (2) limited network bandwidth. At the same time, 
it keeps the overall protocol overhead below that of 2PC and its well known 
variants, namely presumed commit and presumed abort. For this reason, we 
extended 1-2PC to the multi-level transaction execution model, the one specified 
by the database standards and adopted in commercial database systems. We 
also evaluated its performance and compared it to other well known commit 
protocols. Our extension to 1-2PC and the results of our evaluation demonstrates 
the practicality and efficiency of 1-2PC, making it a specially important choice 
for Internet transactions that are hierarchical in nature. 
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Abstract. Repairing a database means making the database consistent 
by applying changes that are as small as possible. Nearly all approaches 
to repairing have assnmed deletions and insertions of entire tuples as 
basic repair primitives. A negative effect of deletions is that when a 
tuple is deleted because it contains an error, the correct values contained 
in that tuple are also lost. It can be semantically more meaningful to 
update erroneous values in place, called update-based repairing. 

We prove that a previously proposed approach to update-based repairing 
leads to intractability. Nevertheless, we also show that the complexity 
decreases under the rather plausible assumption that database errors 
are mutually independent. 

An inconsistent database can generally be repaired in many ways. The 
consistent answer to a query on a database is usually defined as the 
intersection of the answers to the query on all repaired versions of the 
database. We propose an alternative semantics, defining the consistent 
answer as being maximal homomorphic to the answers on all repairs. This 
new semantics always produces more informative answers and ensures 
closure of conjunctive queries under composition. 



1 Introduction 

Database textbooks generally explain that integrity constraints are used for cap- 
turing the set of all “legal” databases and hence should be satisfied at all times. 
Nevertheless, many operational databases contain data that is known or sus- 
pected to be inconsistent. Inconsistency may be caused, among other reasons, by 
data integration and underspecified constraints. For example, the rule “No em- 
ployee has more than one contact address” gives rise to an error if two databases 
to be integrated store different addresses for the same employee. The FIRSTNAME 
CHAR (20) declaration in SQL does not protect us from inputting illegal first 
names like “Louisld” . When later on we specify that first names cannot contain 
numbers, the database may already turn out to be inconsistent. 

Since database inconsistency is a widespread phenomenon, it is important 
to understand how to react to it. The seminal work of Arenas et al. [1] has 
roused much research in the construct of repair for dealing with inconsistency. 
In broad outline, a repair of an inconsistent database / is a database J that is 
consistent and “as close as possible” to I. Closeness can be captured in many 
different ways, giving rise to various definitions of repair. Under any definition. 
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a given database / can generally be repaired in more than one way. When there 
are multiple repairs, the question arising is which repair to use for answering 
queries. The generally accepted query semantics is to execute the query on each 
repair and return the intersection of all query answers, the so-called consistent 
query answer. Intuitively, all repairs are equally possible and only tuples that 
appear in all answers are certainly true. 

Nearly all approaches so far have assumed that databases are repaired by 
deleting and inserting entire tuples. A problem with deletion/insertion-based re- 
pairing is that a single tuple may contain both correct and erroneous components. 
When we delete a tuple because it contains an error, we also lose the correct 
components as an undesirable side effect. For example, it might be undesirable 
to delete the entire tuple: 

(Firstname : Louisl4, Name : De Funfe, Nickname : Fufu, Born : 1914, Died : 1983) 

from a movie star database at the time the first name is found to be illegal. To 
overcome this problem, we proposed in [2] a notion of repairing that allows to 
update the erroneous components in place, while keeping the consistent ones. 
In the current example, the effect of such “update-based repairing” would be 
to ignore Louisl4 while keeping all other components. Although update-based 
repairing is attractive from a semantics point of view, it remained unclear up to 
now whether it is tractable in general. We prove in this paper that the approach 
to update-based repairing introduced in [2] leads to intractability. Nevertheless, 
we also introduce an elegant variant with decreased complexity. 

Technically, our repair construct differs from other approaches in that it re- 
lies on homomorphisms instead of subset relationships between databases and 
repairs. Homomorphisms also naturally lead to a revised notion of consistent 
query answer. We define a consistent query answer as being maximal homo- 
morphic to the query answers on all repairs. This definition not only gives us 
more informative answers, but also ensures closure of conjunctive queries un- 
der composition. Closure is interesting because it allows for consistent views. 
Intersection-based consistent answers, on the other hand, do not give us closure. 

The paper is organized as follows. In Sect. 2, we give an overview of several 
repair constructs and show how they differ on a simple example. In Sect. 3, 
we show that under update-based repairing in its purest form, recognizing re- 
pairs is already intractable for very simple constraints. We gain tractability by 
adding a condition that captures a plausible hypothesis about the nature of er- 
rors. The so-called “independence-of-errors” thesis says that if the same constant 
has more than one erroneous occurrence in the same relation, then there is no 
reason to believe that all these errors should be corrected in the same way. Even- 
tually, this leads to a new repair construct, called uprepair. Section 4 introduces 
a homomorphism-based definition of consistent answer and shows that under 
this new semantics, we obtain closure of conjunctive queries under composition. 
Finally, Sect. 5 concludes the paper. 
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2 Overview of Repair Constructs 

A manufacturer of beauty products stores data about his production capacity. 
The single row in the relation I expresses that soap with mint flavor can be 
manufactured by the LA production plant at a fast rate. We introduce column 
headings for readability. 



I 


Prod 


Flavor 


City 


Rate 




soap 


mint 


LA 


fast 



For simplicity, we assume a unirelational database in the examples and the tech- 
nical treatment; nevertheless, all results extend to databases with multiple rela- 
tions. 

All products that can be produced in LA can be produced in NY at a fast 
rate (ri), and mint products cannot be produced at a fast rate outside NY (ei). 
Furthermore, the production rate is fully determined by the product, the flavor, 
and the city (€ 2 ). 



Ti : Vw, X, z(^P{w, X, LA, z) P{w, x, NY, fast)) 

£i : Vw, y(^P{w, mint, y, fast) y = NY) 

£2 : Vw, x,y,z, z {P{w, x, y, z) A P{w, x, y, z) z = z) 

The symbols e and r refer to equality-generating and tuple-generating full depen- 
dencies [3], respectively. Note, however, that many results apply to larger classes 
of constraints. The predicate symbol P is to be interpreted by the relation I. 

Clearly, the relation I falsifies both t\ and ei . The relation I can be repaired 
by simply deleting its single tuple. However, we can be more subtle and assume 
that one of the values “mint,” “LA,” or “fast” is erroneous. These diverging ways 
of repairing are formalized next. 

The notion of repair is defined relative to a fixed set S of integrity con- 
straints. In general, a repair of a (possibly inconsistent) relation / is a consistent 
relation J that is “as close as possible” to I. Different ways of capturing close- 
ness have resulted in different definitions of repair. Most authors have relied on 
set inclusion to define closeness. In [4], a repair of I is a maximal (w.r.t. C) 
consistent subset of I. This boils down to considering tuple deletions as a repair 
primitive. Other approaches also take care of extensions of the original relation. 
In [I], the symmetric difference between a relation and its repairs is required to 
be minimal (w.r.t. C). This boils down to considering insertions and deletions 
as repair primitives, and to treat both symmetrically. An asymmetric treatment 
of insertions and deletions is proposed by Cali et al. [5]: under what they call 
the “loosely-sound” semantics, they minimize (w.r.t. C) the set of tuples deleted 
during repairing, irrespective of the tuples to be inserted. Importantly, in the 
running example, under each of these semantics, the only repair of I relative to 
{ri,ei,e 2 } would be the empty relation. 

In [2] , we expressed the closeness criterion between a database and its repairs 
in terms of homomorphisms instead of subset relationships. In this approach. 
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Fi 


Prod 


Flavor 


City 


Rate 




soap 


X 


LA 


fast 


F2 


Prod 


Flavor 


City 


Rate 




soap 


mint 


V 


fast 


Fs 


Prod 


Flavor 


City 


Rate 




soap 


mint 


LA 


z 



C/i 


Prod 


Flavor 


City 


Rate 




soap 


X 


LA 


fast 




soap 


X 


NY 


fast 


U2 


Prod 


Flavor 


City 


Rate 




soap 


mint 


NY 


fast 


U3 


Prod 


Flavor 


City 


Rate 




soap 


mint 


LA 


z 




soap 


mint 


NY 


fast 



Fig. 1. The tableaux F\, F2, and F3 are fixes of I and {ri, 61,62}. The tableaux U\, 
U2, and U3 are minimal (w.r.t. extensions of these fixes that satisfy each constraint 
of {n, 61, 62}. 



repairs can contain (existentially quantified) variables, i.e. repairs are tableaux. 
Definition 1 recalls the definition of tableau and introduces the homomorphism 
relationship underlying our work. 

Definition 1. We assume a fixed arity n. A tuple is a sequence (pi,...,p„) 
where each Pi is either a variable or a constant. A tableau is a finite set of 
tuples. A tuple or tableau without variables is called ground; a ground tableau 
is also called a relation. If T is a tableau, then ground(T) is the set of ground 
tuples in T. 

A homomorphism from tableau S to tableau T is a substitution 9 for the vari- 
ables in S that preserves tuples, i.e. {pi, . . . ,pn) G S implies {0{pi), . . . ,9{pn)) € 
T, where it is understood that 6 is the identity on constants. If such a homomor- 
phism exists, then S is said to be homomorphic to T, denoted T > S. 

S and T are equivalent, denoted S ^ T, iff S FT and T F S. 

In Fig. 1, each of Fi, F2, F3 is homomorphic to I. Furthermore, Fi, F2, F3 
are homomorphic to Ui, U2, U3, respectively. It can be easily verified that the 
relation F is reflexive and transitive, and that ~ is an equivalence relation. In the 
database community, homomorphisms are well-known tools used in the context 
of query containment [3]. We now introduce a stronger homomorphism, which 
not only preserves tuples but also cardinality. 

Definition 2. Tableau S is said to be one-one homomorphic to tableau T, de- 
noted T A S, iff there exists a homomorphism 9 from S to T that identifies no 
two tuples of S, i.e. if si, S2 are distinct tuples of S, then 9{si) ^ 6^(52). 

To see the difference between F and □, consider the following tableau E\ 



E 


Prod 


Flavor 


City 


Rate 




soap 


X 


V 


fast 




soap 


mint 


y 


z 
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The substitution 9 = {x/mint, j//LA, z/fast} is a homomorphism from E to I, 
hence I > E. However, I ^ E because any homomorphism from E to I must 
map the two tuples of E onto the single tuple of I. 

At the center of each theory on repairing is the notion of consistency: 

Definition 3. A constraint is a formula a equipped with a semantics that allows 
us to determine whether a tableau T satisfies a, denoted T \= a. For a set E of 
constraints, we write T \= E iffT\=a for each a € E. 

A tableau T is said to be consistent (w.r.t. E) iff T \= E; otherwise T is 
inconsistent. A tableau T is said to be subconsistent (w.r.t. E) iff it is homo- 
morphic to a consistent tableau, i.e. U FT for some consistent tableau U. 

Significantly, we require that satisfaction of constraints be defined for tableaux. 
Often, semantics for ^ defined on relations naturally carries over to tableaux. 
This is the case for full dependencies: 

Definition 4. Let t be a full tuple-generating dependency, and e a full equality- 
generating dependency, i.e. 

T : V* (P(£Ci) A ... A P{xm) P(a;^+i)) , 

e : V* (P(£Ci) A ... A P{xm) p = q) , 

where every variable occurring at the right-hand of also occurs at the left- 
hand of =^. Let T be a tableau. Then, T \= t iff for every substitution 9, if 
9{xi) € T for each i, 1 < i < m, then T ~ P U 9{xm-\-i ). Next, T \= e iff for 
every substitution 9, if 9{xi) € T for each i, 1 < i < m, then 9{p), 9{q) are not 
two distinct constants and T is ^-equivalent to the tableau obtained from T by 
identifying 9{p) and 9{q). 

The tableaux U\, U 2 , and II 3 in Fig. 1 are consistent according to Def. 4. On 
the other hand, both Pi and P 3 falsify ti, and P 2 falsifies ei. Nonetheless, Pi, 
P 2 , and P 2 are subconsistent because they are homomorphic to the consistent 
tableaux Pi, P 2 , and P 3 , respectively. 

Using homomorphisms instead of subset relationships naturally leads to the 
following generalization of deletion-based repairing. 

Definition 5. Let E be a set of constraints. A downrepair of L and E is a 
maximal (w.r.t. E) consistent tableau D satisfying I A D. 

In the running example, two downrepairs Di and D 2 are as follows: 



Di 


Prod 


Flavor City 


Rate 


D2 


Prod 


Flavor 


City Rate 




soap 


X y 


fast 




soap 


mint 


y z 



It can be verified that Di and D 2 are consistent, but that this would no longer 
be the case if one of x, y, or z were replaced by “mint,” “LA,” or “fast,” respec- 
tively. Moreover, every other downrepair is equal to D\ or D 2 up to a renaming 
of variables. Intuitively, both downrepairs assume an error in column City: the 
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C/3 



subconsistent fixes 
consistent downrepairs 
deietion-based repair 



Fig. 2. Tableaux of the running example ordered by Each of C/i, C/ 2 , C /3 is consistent. 



faulty value “LA” is replaced by y. In addition, D\ assumes an error in col- 
umn Flavor {x instead of “mint”), and in column Rate (z instead of “fast”). 
Both downrepairs believe that soap is somehow produced. 

Note that the tableau E = DiU D2 shown earlier is also consistent. However, 
E is not a downrepair because I E. The requirement that a downrepair be one- 
one homomorphic (□) to the original relation guarantees a one-one relationship 
between original and repaired tuples. Moreover, since the cardinality of each 
downrepair is bounded by the cardinality of the original relation, there can be 
only finitely many nonequivalent downrepairs of a given relation I. 

A further refinement consists in taking care of extensions of the original 
relation, which gives us the notion of fix: 

Definition 6. Let E be a set of constraints. A fix of I and E is a maximal 
(w.r.t. >) subconsistent tableau E satisfying I ^ F. 

The difference with downrepairs is that downrepairs need to be consistent, while 
fixes need only be subconsistent. Intuitively, each fix retains a maximal amount 
of data from the original database under the restriction that the retained data 
be reconcilable with the integrity constraints. 

In the running example, Fi, F2, and F3 are three fixes, and all other fixes 
are equal to one of these three up to a renaming of variables. Intuitively, Ai, F2, 
and T3 assume errors in columns Flavor, City, and Rate, respectively. 

Figure 2 shows the different tableaux of the running example ordered by ^ 
(one edge corresponding to U3 h U2 has been deliberately omitted because it is 
nonessential). Intuitively, the higher a tableau in this lattice, the more informa- 
tion it contains. Obviously, each downrepair must necessarily be homomorphic to 
some fix. Also, since I D implies I ^ D, every repair obtained by simply delet- 
ing tuples is homomorphic to some downrepair — on the other hand, as pointed 
out by Chomicki and Marcinkowski (unpublished), a downrepair may have no 
deletion-based repair homomorphic to it. We will focus on fixing hereinafter. Un- 
fortunately, fixing in its purest form leads to intractability (see Sect. 3 ). We will 
therefore need to impose further restrictions on fixes in order to gain tractability. 
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es ■■Mx,y,z (P{x, y) A P{x, z) y = z) 



Name 


Sal 


Ed 


195 


Ed 


205 


An 


195 



Name 


Sal 


F 5 


Name 


Sal 


Fe 


Name 


Sal 


Ed 


195 




X 


195 




Ed 


y 


X 


205 




Ed 


205 




Ed 


205 


An 


195 




An 


195 




An 


y 



F( 


Name 


Sal 




Ed 


205 




An 


195 



Fig. 3. Example shedding light on the difference between fixes and i_fixes. 



3 I_fixes and Uprepairs 

Let if be a set of constraints. Fix checking is the following problem: on input of a 
relation I and tableau F, decide whether F is a fix of / and if. The corresponding 
problem for deletion/insertion-based repairing has been studied in detail in [4, 
6]. All complexity results in this paper refer to data complexity, i.e. the set 
of constraints and the arity are fixed, and the complexity is in terms of the 
cardinality of the input tableaux. Unfortunately, fix checking is intractable, even 
for very simple constraints. 

Theorem 1. Let 



C3 : Vx, y{P{x, y) x = a) 

€4 ■■ Vx, y(P(x, y) =^y = a) 

Given a relation I and a tableau F (both of arity 2), it is NP-hard to decide 
whether F is a fix of I and {£ 3 , £ 4 }. 

Proof Urex. Reduction from GRAPH-3-COLORABILITY [7]. □ 

Consistent query answering (see Sect. 4) can be expected to be at least as hard 
as fix checking. Hence, the NP-hardness result of Theorem 1 means that the 
fix construct, though semantically clean, is unlikely to be practical for querying 
purposes. It can be shown that downrepair checking is also NP-hard. 

Fortunately, fix checking becomes tractable if we require that no variable 
occurs more than once in a fix. Interestingly, this requirement is often quite 
natural and even desirable, as illustrated by the following example. The relation 
/ of Fig. 3 stores employee salaries, and the constraint £5 expresses that no 
employee has more than one salary. Clearly, I is inconsistent; three fixes are 
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F4, F5, and Fq. Note that these fixes are not comparable by F, and that is 
equivalent to the smaller relation F^. 

The fix Fq contains two occurrences of the same variable y. In this way, Fq 
preserves the information that two tuples of / record the same salary for Ed and 
An. In addition, the second tuple of Fq states that Ed earns 205. The conclusion 
should be that An earns 205 as well. In fact, the following relation Uq is the 
smallest (w.r.t. consistent tableau satisfying Uq F Fq. Practically, Uq can be 
obtained by a chase [8] of Eg by {eg}. 



Ue 


Name 


Sal 




Ed 


205 




An 


205 



However, repairing I into Uq may be counterintuitive: looking at /, it seems 
weird to believe that An’s salary is not 195. Intuitively, the fact that Ed and An 
earn the same salary in I is just a coincident which should not be taken too far. 

From now on, we will require that no variable occurs more than once in a 
fix. The practical motivation is the “independence-of-errors” thesis: if multiple 
occurrences of the same value are erroneous, then this should be considered a 
coincident not to be taken into account during repairing. With this thesis, only 
E4 and Eg would be legal ways of fixing. 

Etableaux and i_fixes differ from tableaux and fixes in that they cannot 
contain multiple occurrences of the same variable. The prefix “i” may be read 
as “independence.” 

Definition 7. An i_tableau is a tableau T in which no variable occurs more 
than once. 

Let I be a relation and S a set of constraints. An i Tix of I and S is a 
maximal (w.r.t. F) subconsistent Ltableau F such that I A F. 

We know that fix checking is intractable (Theorem 1). Theorem 2 concerns the 
tractability of i_fix checking: given a fixed set S of constraints, on input of a 
relation / and an iTableau E, decide whether E is an i Tix of / and S. The 
result applies to all constraints for which tableau subconsistency can be checked 
in polynomial time. 

Definition 8. A set S of constraints is said to be in SUBCON*’ iff for every 
tableau F, it can be decided in polynomial time in |E| whether F is subconsistent. 



Proposition 1. Every finite set of full dependencies is in SUBCON**. 



Theorem 2. Let E be a set of constraints in SUBCON**. For every relation 
I and Ltableau F, deciding whether F is an i_fix of I and S is in PTIME. 

The next natural step is to pass from subconsistent iJixes to consistent tableaux, 
called uprepairs. The prefix “up” is used to emphasize the difference with down- 
repairs (see Def. 5) and with other repair notions in the literature. 
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Definition 9. Let I he a relation and S a set of constraints. An uprepair of I 
and S is a minimal (w.r.t. y) consistent tableau U such that U ^ F for some 
i-fix F of I and E. 

If 17 is a set of full dependencies and F an then the uprepair corresponding 
to F is unique up to ~ and can be computed in polynomial time in \F\ by an 
algorithm known as the chase [8]. 

The fixes F\, F 2 , and T3 of Fig. 1 are also i_fixes since no variable occurs 
more than once in each of them; the corresponding uprepairs are Ui, U 2 , and 
C/3, respectively. 



4 Consistent Qnery Answers and Consistent Views 

Our notion of uprepair is homomorphism-based rather than subset-based. The 
advantage is that homomorphisms allow us to “hide,” through the use of vari- 
ables, erroneous values in a tuple while preserving the consistent ones. We now 
provide a homomorphism-based notion of consistent query answer, and discuss 
its advantages over the commonly used intersection-based definition. 



4.1 Infimum 

Given a set S of tableaux, one can construct a unique (up to ~) maximal (w.r.t. 
tableau T that is homomorphic to each tableau of S [2,9]. 

Definition 10. Let S be a finite set of tableaux. An infimum of S is a tableau 
T satisfying: 

1. S FT for each S G S, and 

2. for each tableau T' , if SFT' for each S gS, then T F T' . 

Any set S of tableaux has an infimum, which is necessarily unique up to 
i.e. the set of infimums of S is an equivalence class of We assume there 
is an arbitrary selection rule that picks a representative of this ^-equivalence 
class and denote it “inf S” . The results hereafter do not depend on the actual 
representative chosen. 

In what follows, let U = {f/i, C/2, C/3}, the set of uprepairs of our running 
example (see Fig. 1). The infimum of U is the following tableau: 



infU 


Prod 


Flavor City 


Rate 




soap 


X NY 


fast 



Intuitively, the information common to each uprepair is that soap is produced 
in NY at a fast rate. 
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4.2 New Notion of Consistent Answer 

In what follows, we focus on conjunctive queries expressed as rules [3], staying 
within a unirelational database perspective. 

Definition 11. A conjunctive query is a rule 



Q ■ (^o) : Pilin') 



where every variable occurring in Xq also occurs at the right-hand of The 
answer to Q on input of a tableau T, denoted Q(T), is the smallest tableau 
containing the (not necessarily ground) tuple t if there exists a substitution 9 
such that 9 {xq) = t and 9{xi) € T for each i, 1 < i < m. The arity of the 
predicate symbol P is called the input arity of Q; the arity of Xq is the output 
arity. 

The query answer Q{T) can contain variables that occur in T; it generally suffices 
to know r, and hence Q{T), up to a variable renaming. 

A database repair need not be unique, so we can assume a set S of repairs. 
Accordingly, consistent query answering extends query semantics from a single 
tableau T to a set S of tableaux. In the following definitions, one may think 
of the set S as the set of all repairs (uprepairs in our framework). The general 
practice for answering a query Q is to return the intersection of the query answers 
on each repair, denoted Q^. We also propose an alternative semantics, denoted 
which returns the infimum of the query answers on each repair. 

Definition 12. Let S be a finite set of tableaux and Q a conjunctive query. We 
define intersection-based and infimum-based query semantics, as follows: 

Q^{S) := f|{ground(Q(T)) | T € S} 

:= inf{Q(T) | T € S} 

Note that semantics first applies ground(-) (see Def. 1) to remove tuples with 
variables. Alternatively, we could assume w.l.o.g. that no two distinct tableaux 
of S have a variable in common, which makes the use of ground(-) superfluous. 
To determine <5‘^(S) or Q“^(S), it suffices to know the tableaux of S up to 
^-equivalence: 

Proposition 2. Let Si and S 2 be two finite sets of tableaux such that for each 
tableau in either set, there exists an equivalent tableau in the other set. For each 
conjunctive query Q, Q'^(Si) = Q'^(S 2 ) and Q“^(Si) ^ Q'"^(S 2 ). 

Consequently, to determine consistent query answers in our framework, it suffices 
that S is a finite set containing, for each uprepair U , a tableau equivalent to U . 
The following proposition expresses that always provides more informative 
answers than 

Proposition 3. Let S be a finite set of tableaux. For each conjunctive query Q, 

D Qn(s). 
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Assume a query V asking which products can be manufactured in which flavors 
in NY: 



V : {w, x) •<— P{w, X, NY, z) 

For the set U = {C/i, C/ 2 , C/ 3 } of uprepairs, we obtain: 

Yn(U) = {} and Y“f(U) = {(soap,a:)} . 

That is, the consistent answer under inflmum-based semantics is that soap is 
produced in some (unknown) flavor. This is more informative than the empty 
answer obtained under intersection-based semantics. 



4.3 Consistent Views 

We saw that one advantage of over is that it always yields more in- 
formative answers (Prop. 3). We now show a second advantage: semantics 

supports consistent views, while does not. 

An appealing property of the relational model, known as closure, is that the 
result of a query on a database is a new relation that can be queried further. 
This allows, among others, for the construct of view. A view on a database I is 
specified by a view name and a view definition; the view definition is some query 
V on /. If this view is queried further by a query Q, then the answer returned 
should be Q{V{I)). The “intermediate” result V{I) need not be materialized 
because the same answer can be obtained by executing QoV{I), where Qo V is a 
single new query that is the composition of Q and V. Computing the composition 
is also known as “view substitution.” The closure of conjunctive queries under 
composition [3, Theorem 4.3.3] motivates the following definition: 

Definition 13. Let Q and V be conjunctive queries such that the output arity 
of V is equal to the input arity of Q. We write Q oV for the conjunctive query 
satisfying for each tableau T, Qo V{T) = Q{V{T)). 

Importantly, under semantics, conjunctive queries remain closed under com- 
position: 

Theorem 3. Let S be a finite set of tableaux. Let Q, V be conjunctive queries 
such that the output arity ofV is equal to the input arity ofQ. 

Then, (Q o V)“f(S) - Q(V“f(S)). 

Proof. Crux. It takes some effort to show that inf and conjunctive queries com- 
mute, i.e. Q'”^(S) ~ Q(inf S) for any conjunctive query Q. Then, {Q o V)'"^(S) ^ 
Q o V(inf S) = g(V(inf S)) ~ Q(V”f (S)). 

For example, let Q be the following query on the output of V defined in Sect. 4.2: 

Q : (w) ^ V{w,x) 
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Obviously, the composition of V followed by Q asks for products manufactured 
at NY: 



Q oV : (w) ^ P{w, X, NY, z) 

We have (Qo Y)“(U) = Q(Y“f(U)) = {(soap)}. 

To conclude, the consistent answer to Q oV under infimum-based semantics 
can be obtained by executing Q on the consistent answer to V. It is easy to see 
that this nice property does not hold for intersection-based semantics. Indeed, 

(QoY)n(U) = {(soap)| , 

but since Y'^(U) is empty (see Sect. 4.2), the answer {(soap)} cannot be obtained 
by issuing Q (or any other query) on Y'^(U). Consequently, intersection-based 
semantics does not support consistent views. 



5 Discussion and Related Work 
5.1 Related Work 

Although theoretical approaches to reasoning with inconsistent information date 
back to the 80s, the distinction between “consistent” and “inconsistent” answers 
to queries on databases that violate integrity constraints, seems to be due to 
Bry [10], who founded the idea on provability in minimal logic. Consistent query 
answering gained momentum with the advent of a model-theoretic construct of 
repair [1]. Different logic programming paradigms [11,12,13] have been used to 
characterize database repairs and consistent query answers. Other formalisms 
used to this extent include analytic tableaux [14] (unrelated to our tableaux) 
and annotated predicate calculus [15] . Given a consistent deductive database and 
an update that may render the database inconsistent, Mayol and Teniente [16] 
define minimal insertions, deletions, and modifications for restoring the database 
to a consistent state. They mention that these integrity maintenance techniques 
can be adapted to repair an inconsistent database. 

All approaches cited so far (including ours) assume a single database con- 
text. The data integration setting of [17,18] considers a global database which 
is retrieved by queries over local source databases. Inconsistency can arise rela- 
tive to integrity constraints expressed over the global schema. Lembo et al. [17] 
study the computation of consistent answers to queries over the global schema 
when the constraints are key and inclusion dependencies. Complexity results of 
consistent query answering under key and inclusion dependencies appear in [5]. 

Consistent query answering also appears in the context of data integration 
in [19,20]. From [19], it follows that under certain acyclicity conditions, sets of 
not-necessarily-full tuple-generating dependencies are in SUBCON^ and hence 
are covered by our results. This observation is important because not-necessarily- 
full tuple-generating dependencies are needed to capture foreign key constraints 
in general. 
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We have noticed (e.g. in Prop. 2) that it is often a waste of time to exam- 
ine more than one representative of an ^-equivalence class. For space efficiency 
reasons, we may be interested in finding the representative of the smallest size; 
finding minimal (w.r.t. size) representatives within an ^-equivalence class sub- 
ject to certain conditions is at the center of [20]. 

Recent comprehensive overviews of database repairing and consistent query 
answering appear in [6,21]. 



5.2 Concluding Remarks 

Let / be a database that is replaced by a consistent database J (possibly I = J 
if / is consistent). All repair notions that have appeared in the literature [1,5,11, 
13,14,15,17,22] measure the distance between a database and its repairs in terms 
of the tuples inserted (J\I) and deleted (/\ J) during repairing. Insertions and 
deletions can be treated symmetrically [1,12,13] or asymmetrically [5]. In [4], 
only tuple deletions are possible as repair primitive. 

In [2] , we first proposed to use homomorphisms instead of subset relationships 
for comparing a database and its repairs. This allows rectifying a value within a 
tuple while preserving other correct values within the tuple. This is significant as 
many operational databases contain “long” tuples, and errors seldom affect the 
entire tuple. In general, our approach cannot be simulated by deletion/insertion- 
based repairing. 

We have shown (Theorem 1) that for the construct of update-based repairing 
proposed in [2], it is already NP-hard (data complexity) to decide whether a 
given tableau F is a fix of a relation / and a set S of equality-generating full 
dependencies. Fortunately, we were able to show (Theorem 2) that the problem 
is in PTIME if no variable occurs more than once in F, i.e. if F is an i_fix. This 
restriction to i_fixes accomplishes the quite natural “independence-of-errors” as- 
sumption. 

By requiring that i_fixes be maximal (w.r.t. F), we retain as much informa- 
tion as possible from the original database; only data that cannot possibly be 
reconciled with the integrity constraints is removed. This is similar in nature 
to the “loosely-sound” semantics proposed in [5] for deletion/insertion-based re- 
pairing. A consistent minimal extension of an i_fix is an uprepair. Uprepairs, in 
general, preserve more consistent information than other repair constructs in the 
literature. 

Nearly all research on consistent query answering has focused on (the com- 
plexity of) computing consistent answers. So far, little attention has been paid 
to revisiting existing database theory under this new query semantics. In this 
respect, we showed that to maintain the closure of conjunctive queries under 
composition (and hence the principle of view substitution), consistent query an- 
swers should be infimum-based rather than intersection-based (see Theorem 3) . 
Moreover, we showed (Prop. 3) that the newly proposed semantics always yields 
more informative answers. 
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Abstract. Many requirements for a business process depend on the workflow 
execution data that includes common data for all the population of processes, 
state of resources, state of processes, etc. The natural way to specify and im- 
plement such requirements is to put them into the process definition. In order to 
do it, we need: (1) a generalised workflow metamodel that includes data on the 
workflow environment, process definitions, and process execution; (2) a power- 
ful and flexible query language addressing the metamodel; (3) integration of a 
query language with a business process definition language. In this paper the 
mentioned workflow metamodel together with the business process query lan- 
guage BPQL is presented. BPQL is integrated with the XML Process Definition 
Language (XPDL) increasing significantly its expressiveness and flexibility. 
We also present practical results for application of the proposed language in the 
OfficeObjects® WorkFlow system. 



1 Introduction 

During the last decade workflow management systems (WfM systems) made a suc- 
cessful career. The WfM systems have been used for implementing various types of 
business processes. Despite many advantages resulted from application of such sys- 
tems, there were also observed significant limitations. One of the major restrictions 
was assumption that business processes do not change too often during their execu- 
tion. While such assumption may be satisfied for majority of production processes, 
for less rigid processes, such as administration ones, this is not true. Because of the 
nature of the latter processes, they need to adapt frequent changes in workflow envi- 
ronment (i.e. resources, data and applications) as well as workflow itself (e.g. the 
current workload of participants) [21], [1], [23]. 
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An approach to increase processes adaptability is to make their definition more 
flexible. In this context flexible means that it is possible to express complex dynamic 
requirements that depend on process execution history as well as current organisa- 
tional and application data (referred further to as relevant data). An alternative is 
manual control of processes and their resources at run time and this may be the only 
way for complex and unpredictable cases. However for many quite complex require- 
ments, such as workflow participant assignments and transition conditions, this ap- 
proach seems to be the closest to the real business process behaviour. 

In order to express the mentioned requirements, we need to define an appropriate 
process metamodel, then to develop a language to query this model, and finally to 
integrate this language with a process definition language. The process metamodel 
has to be generic and include both process definition as well as process execution 
entities. So far, there is at least one widely known process definition metamodel pro- 
posed in [24] standard by Workflow Management Coalition (WfMC). While this 
metamodel is well defined, there is no standard process execution metamodel pro- 
vided neither by WfMC standards nor by other process management body (e.g. 
BPMI). In addition, process execution models provided by WfM systems seem to be 
tool oriented and mainly focusing on entities implemented within a given system. 
Recently, there is some effort in defining such generalised process execution model 
(e.g. [11]). However, it does not include some process features proposed within the 
last workflow research such as advanced time management [9] or flexible workflow 
participant assignment [14]. 

In order to make process definition more flexible, in the next step we need to de- 
velop a language to query the mentioned metamodel. This language should be able to 
express all possible queries on the metamodel, and should be readable and clear for 
process designers which are not necessarily software programmers. 

In Section 2 we propose a generalised process metamodel as an extension of the 
WfMC’s metamodel; the extension concerns entities related to process execution. In 
Section 3 on the basis of this metamodel we define a Business Process Query Lan- 
guage (BPQL). First we specify process definition elements where this language may 
be used and then define its syntax and semantics followed by some aspects of its 
pragmatics. In Section 4 we present integration of BPQL with XML Process Defini- 
tion Language (XPDL). In Section 5 we present practical results of application the 
mentioned language in a commercial workflow management system that is OfficeOb- 
jects® WorkFlow. In Section 6 we discuss related work. Section 7 concludes. 



2 Process Metamodel 

In order to specify a workflow query language in the first stage an appropriate work- 
flow process metamodel should be defined. It should represent two parts of ‘workflow 
process puzzle’: process definition and process execution. The former part is mainly 
used by workflow engines to execute workflow processes while the latter helps moni- 
toring and analysing workflow process execution. 

Since WfM systems are only a part of IT applications, there is a need to specify re- 
quirements for the workflow environment. From the workflow point of view such 
systems have three dimensions: processed data, provided services and registered re- 
sources that may execute the services operating on the data. A part of these data is 
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Fig. 1. Workflow system as a part of IT application 

used by WfM systems to control execution of workflow processes (i.e. flow condi- 
tions, and workflow participant assignment). WfM systems have rights only to read 
these data, not to modify them. In WfMC terms these data are referred to as workflow 
relevant data. Services provided by information systems may be used to express proc- 
ess activities. During execution of activities WfM systems call these services with 
appropriate parameters In the WfMC terminology these data are called applications. 
There are also resources that include users or automatic agents that may perform some 
activities within workflow processes. 

Resources may be also selected using roles, groups or organizational units. A re- 
source that may participate in process execution is called a workflow participant. In 
addition to the mentioned elements WfM systems use workflow control data. These 
data are managed only by WfM systems and store workflow specific information such 
as number of active process instances, international setting, etc. 

The workflow process metamodel defines workflow entities, their relationships and 
basic attributes. It consists of two parts, namely a workflow process definition meta- 
model and a workflow process instance metamodel. The former part defines the top- 
level entities contained within process definition, their relationships and basic attrib- 
utes. It is used to design and implement a computer representation of business proc- 
esses. The latter part defines the top-level entities contained within process instantia- 
tion, their relationships and basic attributes. This metamodel is used to represent 
process execution that is done according to the process definition model. The work- 
flow process metamodel also shows how individual process definition entities are 
instantiated during process execution. 

The main entity of the definition metamodel is process definition. It provides basic 
information about the computer representation of a business process. For every proc- 
ess a set of data container attributes is defined. This attributes are used during proc- 
ess execution in the evaluation of conditional expressions, such as transition condi- 
tions or pre and post conditions. The set of container attributes (i.e. number, types, 
and names) depends on individual process definitions. 
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Fig. 2. Process metamodel 

Process definition consists of activities. An activity defines a piece of work that 
forms one logical step within a process. There are three types of activities: atomic, 
route and compound ones. An atomic activity is a smallest logical step within the 
process that may not be further divided. For every atomic activity it is possible to 
define who will perform it, how it will be executed and what data will be processed. 
An atomic activity may be performed by one or more workflow participants. Basi- 
cally, a workflow participant is a user (or their group), a role or an organizational unit. 
In addition, the system can be treated as a special workflow participant. Specification 
of workflow participants that may perform a given activity is called workflow partici- 
pant assignment. The way how activity will be performed is specified by application 
which is executed on behalf of it. 

Such specification also includes a set of parameters that will be passed to the appli- 
cation. Since the mentioned application operates on data, also object types that will be 
processed (i.e. created, modified, read or deleted) by this activity has to be defined. 
Object types may be considered as workflow relevant data that is a part of application 
data related to and processed by workflow processes. 

The second type of activity is a compound activity. This type of activity represents 
a sub process and usually is defined to simplify process definition and make use of 
common activities that exist in many business processes. The last type of activity is a 
route activity. It is used to express control flow elements, namely split and join opera- 
tions. For both split and join operations two basic control flow operators are defined: 
AND and XOR. On the basis of these operators it is possible to express more complex 
flow operations. A route activity is performed by the system. Since this type of activ- 
ity is a skeletal one, it performs no work processing, neither object types nor applica- 
tion is associated with it. 
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The order of activities within the process is defined by transitions. A transition de- 
fines the relation between two activities. Transition from one activity to another one 
may be conditional (involving expressions which are evaluated to permit or inhibit the 
transition) or unconditional. 

When a workflow process has been defined, it may be executed many times. Exe- 
cution of a workflow process according to its definition is called a process enactment. 
The context of such an enactment includes real performers (workflow participants), 
concrete relevant data, and specific application call parameters. 

The representation of a single enactment of a workflow process is called a process 
instance. Its behavior is expressed by states and described as a state chart diagram. 
The history of states for a given process instance is represented by process instance 
state entities. Every process instance has its own data container. This container is an 
instantiation of a data container type defined for workflow process and includes con- 
tainer attributes that are used to control process execution (e.g. in flow conditions). 

Execution of a process instance may be considered as execution of a set of activity 
instances. Every activity instance that is an atomic activity is performed by one work- 
flow participant. If more that one participant is assigned to the activity, then this 
activity is instantiated as a set of activity instances with one performer for each of 
them. Such activity instance may be executed as an application called with specific 
parameters and operates on data objects that are instances of data types assigned to 
the activity during process definition. 

If an activity instance is a route activity it is performed automatically by the system 
and there is neither application nor data objects assigned to it. In the case when activ- 
ity instance is a sub-process, it is represented by another process instance executed 
according to the definition of the mentioned sub-process. 

Similarly to the process instance, behaviour of an activity instance is represented 
by state diagram and stored as activity instance state entities. 

Flow between activity instances is represented by transition instances. When an 
activity instance is finished, the system checks which transitions that are going from 
this activity may be instantiated. If a transition has no condition or transition condi- 
tion is satisfied, it is automatically instantiated by the system. A transition instance 
may be considered as a relation ‘predecessor-successor’ between two activities. 



3 Business Process Query Language 

The process metamodel described in the previous section specifies basic workflow 
entities and relationships between them. In order to extract information from this 
metamodel, it is necessary to define a query language addressing business processes. 

This language like other standard query languages should have clear syntax and 
unambiguous semantics. It should be able to express complex queries. It also needs to 
be coherent and complete to ask all possible queries to the process model. It should be 
an object-oriented language rather than a relational one since the process metamodel 
is much easier to represent through objects and associations. 

This language should also provide core workflow specific functionality, namely: 

• functions to simplify operations on process definition/instance graphs, especially 
to retrieve information about the first activity, current activity, its predecessors 
as well as possible successors. 
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• functions to extract some information about the context of calling a given query, 
e.g. in the case of using function to retrieve the activity that is currently proc- 
essed it is required to know which process that query is asking for. 

It seems that these two kinds of requirements for business process query languages 
may be satisfied by selecting the most suitable standard query language and extend it 
by a set of process specific features/functions. 

Because of popularity of XML the family of XML query languages (e.g. XQuery) 
comes as the first candidate for the mentioned standard language [22]. They have 
quite clear syntax and reasonable set of operators. However, these languages address 
a hierarchical data structure with no cycles and many-to-many relationships. This 
kind of relationship is used quite often in the process model, for instance, the prede- 
cessor-successor association between activities or activity instances is a many-to- 
many one. The next candidate is the family of SQL query languages. These languages 
are very popular and widely used. They are able to operate on relational data and 
query various types of relationships between entities. Despite many advantages of 
these languages their syntax is quite complicated (see successive specifications of 
SQL-89, SQL-92 [12] and SQL-99 [13]). Some constructs of SQL introduce limita- 
tions and semantic reefs (e.g. group by operator or null values) and sometimes are 
criticised for their ambiguous semantics. While SQL-99 covers not only relational, 
but also object-relational structures, implementation of it presents too big challenge 
(ca. 1500 pages of specification), clearly out of the potential of a small software en- 
terprise. 

The last alternative is the family of object-oriented query languages. They seem to 
be the most appropriate to express various types of queries to the business process 
metamodel. As far as well known specifications of object-oriented or object-relational 
specifications are concerned, it seems that there is no good solution for the require- 
ments specified at the beginning of this section. There are many opinions that the 
most known ODMG OQL [15] does not provide clear and coherent specification [2, 
19]. Moreover, lack of convincing formal semantics causes that query optimisation in 
OQL is still an open question. Fortunately, there is one more candidate that has re- 
cently joined the group, that is the Stack-Based Query Language (SBQL) [17, 18, 20] 
Unlike other presented candidates this language has simple, formal and coherent se- 
mantics. SBQL has very powerful query optimisation methods, in particular, methods 
based on query rewriting, methods based on indices and methods based on removing 
dead queries. Currently it has several implementations, in particular, for the European 
project [9], for XML repositories based on the DOM model, and for the object- 
oriented DBMS Objectivity/DB. SBQL syntax (at it is shown in the next sections) is 
very simple and seems to be easier to understood and use. 



3.1 Syntax, Semantics, and Pragmatics of BPQL 

The BPQL syntax is defined as a context-free grammar specified in the EBNF nota- 
tion. Semantics of BPQL follows the semantics of SBQL and it is based on the opera- 
tional method (abstract implementation machine). Pragmatics of BPQL concerns how 
it can be used properly and what is the reason to use it. Pragmatics includes some 
visualisation of the process metamodel and rules of developing BPQL queries accord- 
ing to the metamodel and according to the assumed business ontology. In this section, 
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BPQL pragmatics will be illustrated by examples showing definitions of a business 
processes and corresponding queries. 

The fundamental concepts of BPQL are taken from the stack-based approach 
(SBA) to query languages. In SBA a query language is considered a special kind of a 
programming language. Thus, the semantics of queries is based on mechanisms well 
known from programming languages like the environment (call) stack. SBA extends 
this concept for the case of query operators, such as selection, projection/navigation, 
join, quantifiers and others. Using SBA one is able to determine precisely the opera- 
tional semantics of query languages, including relationships with object-oriented 
concepts, embedding queries into imperative constructs, and embedding queries into 
programming abstractions: procedures, functional procedures, views, methods, mod- 
ules, etc. 

SBA is defined for a general object store model. Because various object models in- 
troduce a lot of incompatible notions, SBA assumes some family of object store mod- 
els which are enumerated MO, Ml, M2 and M3. The simplest is MO, which covers 
relational, nested-relational and XML-oriented databases. MO assumes hierarchical 
objects with no limitations concerning nesting of objects and collections. MO covers 
also binary links (relationships) between objects. Higher-level store models introduce 
classes and static inheritance (Ml), object roles and dynamic inheritance (M2), and 
encapsulation (M3). For these models the formal query language SBQL (Stack-Based 
Query Language) is defined. SBQL is based on abstract syntax and full orthogonality 
of operators, hence it follows the mathematical flavour of the relational algebra and 
calculi. SBQL, together with imperative extensions and abstractions, has the computa- 
tional power of programming languages. Concrete syntax, special functionality, spe- 
cial features of a store model and a concrete metamodel allow one to make from 
SBQL a concrete query language, in particular BPQL. 

SBA respects the naming-scoping-binding principle, which means that each name 
occurring in a query is bound to the appropriate run-time entity (an object, an attrib- 
ute, a method a parameter, etc.) according to the scope of its name. The principle is 
supported by means of the environment stack. The classical stack concept imple- 
mented in almost all programming languages is extended to cover database collec- 
tions and all typical query operators occurring e.g. in SQL and OQL. The stack also 
supports recursion and parameters: all functions, procedures, methods and views 
defined by SBA can be recursive by definition. Rigorous formal semantics implied by 
SBA creates a very high potential for query optimization. Full description of SBA and 
SBQL is presented in [20]. 

Following SBQL, a BPQL query can return a simple or complex value that is con- 
structed from object identifiers, atomic values and names. The definition of a query 
result is recursive and involves basic structure and collection constructs (struct, bag, 
sequence). In particular, a query can return an atomic value (e.g. for query 2-1-2), a bag 
of references to attributes (e.g. for query Performer.Name), a collection of named 
references to objects (e.g. for query Performer as p). BPQL is based on the modular- 
ity rule which means that semantics of a complex query is recursively composed from 
semantics of its components, up to atomic queries which are literals, object names and 
function calls. 

BPQL includes all basic non-algebraic operators (SBA term), such as quantifiers, 
selection, dependent join, projection, navigation (path expressions). There is also 
many algebraic operators, among them a conditional query if q, then else q^, alias- 
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ing (operator as), typical arithmetic/string operators and comparisons, boolean opera- 
tors, and others. A simplified syntax of BPQL is presented in Appendix. 

A BPQL query can also include a function call. A function may have arguments, 
which are BPQL queries too. A function returns a result, which is compatible with 
query results (e.g. can he a collection of OlDs) hence function calls can he freely 
nested in BPQL queries. BPQL introduces a set of core functions, both standard ones 
and workflow specific ones. The standard functions includes mathematical functions 
(e.g. COS, SIN, SQRT), string functions (e.g. CONCAT, SUBSTRING), date and 
time functions (e.g. CURRDATE, YEAR, MONTH), and aggregate functions (e.g. 
AVG, COUNT, MAX). 

So far, there are four workflow-specific functions. They were implemented in 
BPQL in advance to simplify queries. 

• FirstActlnst(ProcessInstance): Activityinstance - returns an object that repre- 

sents the start activity for a process instance which is passed as the argument 
of the function. 

• PrevActInst( Activity Instance): Activityinstance [] - returns a list of objects that 

represent direct predecessors of the activity passed as the argument of the 
function. 

• ActInst(ProcessInstance, Actid): Activityinstance [] - returns a list of activity 

instances that are instantiation of the activity within a process instance. Both 
process instance object and identifier of the activity are passed as the function 
arguments. 

• NextActlnst) Activity Instance): Activityinstance [] - returns a list of objects that 

represent direct successors of the activity passed as the argument of the func- 
tion. 

In addition to the above functions, two context-related functions are provided: 

• ThisProcessInst: Processinstance - returns the object that represents the process 

instance on behalf of which a given query is executed. 

• ThisActivitylnst: Activityinstance - returns the object that represents the activity 

instance on behalf of which a given query is executed. 

All the class and attribute names as well as function names and their parameters are 
stored in the internal BPQL dictionary. A couple of examples of using BPQL are 
presented below. 

Example 1 - Optional Activity 

The ‘advanced verification’ activity should be performed only if the deadline for a 
given process instance is greater then 2 days and exists an expert which the current 
workload is less than 8 working hours. 

(ThisProcessInst.deadline - Currdate) > 2 and 
exists (Performer as P where 

sum( (P.performs.ActivityInst where status = ‘open’). duration) < 8) 



Example 2 - Participant Assignment 

An activity should be executed by the performer of the first activity or, if this per- 
former has more than five delayed tasks to perform, by the performer of the previous 
activity. 
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if 

count(FirstActInst(ThisProcessInst).performedBy. 
Performer.performs. Activity Inst where 
(delayed = ‘yes’ and status = ‘open’) <= 5 
then FirstActInst(ThisProcessInst).performedBy. Performer 
else PrevActInst(ThisActInst).performedBy. Performer 



3.2 BPQL and Process Definition 

BPQL may be used to generalise process definition. It is able to express some re- 
quirements on the process definition that depends on process execution data (e.g. the 
performer of this activity is a seller with the minimal work-load). As will be shown in 
the next section, BPQL queries may simplify process definition reducing the number 
of defined activities which have to be ‘artificially’ introduced to cope with the men- 
tioned requirements. 

BPQL may be used in the process definition in all the elements that operate on 
relevant or workflow control data such as transition condition, workflow participant 
assignment, pre and post-activity conditions, and event handling (condition part). 

Moreover, BPQL exposes the mentioned requirements directly in the process defi- 
nition giving more knowledge of its elements to the process designers. Before, these 
element had to be hidden to the process designers and written as programming proce- 
dures. Such situation makes process modification harder since very often it is neces- 
sary to modify the code of procedures instead of modifying the process definition 
itself. BPQL gives the chance to the process designers to change process definition 
without (or with less) interfering in the programming stuff. 



3.3 How It Works - An Example 

Let us assume that there is a simplified version of a process for ordering laptops. 
Every registered customer may order any number of laptops. An order made by a 
customer is then accepted by a company seller which is responsible for verifying 
financial status of the customer and ability to meet the order at the requested time. For 
a bigger order or if there is not too much time for its acceptance, the order is served 
by a senior-seller. Otherwise, it is served by a plain seller. In addition, the order 
should be served by a seller that has minimal work-load. At this stage of process im- 
plementation ‘minimal’ means a person which has minimal number of tasks assigned. 
If the order was accepted, it is sent to the production department for completing. It is 
an assumption that all the orders are processed by the company. 

Even in this simplified example some of the requirements for the process have to 
be defined on the basis of process execution data. For instance, the ‘minimal work- 
load’ requirement may be only expressed using a query on the current task assign- 
ment. Yet, to select a kind of a seller for accepting the order may be defined by a 
condition on process relevant data (workflow environment). The requirement may be 
easily expressed using BPQL. What is more, their definition in BPQL simplifies the 
definition of the process making it more generalised. Instead of defining two activities 
for accepting the order: one for ‘Senior-seller’ and another for ‘Seller’, it is possible 
to define only one for both mentioned workflow participants. Such approach seems to 
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be more adequate the real business process that is modelled and computerized. The 
role ‘seller’ may be defined in BPQL as follows': 



1. 


(if Order. Value > 30000 or Order.DeliverDate - CurrDate < 2 


2. 


then 


3. 


User where position = ‘Senior-Seller’ and 


4. 


coMnhis.Performer.performs.ActivityInst where status=‘open’) 


5. 


= 


6. 


m;>i((User where position = ‘Senior-Seller’). 


7. 


coMnt(is.Performer.performs.ActivityInst where status=‘open’)) 


8. 


else 


9. 


User where position = ‘Seller’ and 


10. 


connhis.Performer.performs.Activitylnst where status=‘open’) 


11. 


= 


12. 


mm((User where position = ‘Seller’). 


13. 


coMnt(is.Performer.performs. Activityinst where status=‘open’)) 



Line 1 defines the condition to select either a senior-seller (lines 3-7) or a seller 
(lines 9-13). A senior-seller that will perform a given activity is a person employed at 
the position ‘Senior-seller’ (line 3) and which has the minimal work-load (line 4) 
among all senior sellers (lines 6-7). Similarly for the users with the ‘Seller’ position. 
The condition status=’open' specifies only those tasks that are currently being per- 
formed. The function count determines the number of activity instances that are cur- 
rently performed by a given user. Similarly, the function min determines a minimal 
number of tasks assigned to ‘Senior-seller’ (lines 6-7) and a ‘Seller’ (lines 12-13). 
Association is connects User objects to Performer objects, and association performs 
connects Performer objects to Activityinst objects (note corresponding path expres- 
sions). Because in general the query may return several users (all with minimal work- 
load), the additional function is necessary that takes randomly one of them. The query 
can be optimised by using an index on the User position attribute and by factoring out 
independent sub-queries (in this case both sub-queries starting from min). 



4 Integration of BPQL and XPDL 

It seems that the best way of using BPQL in process definition is to integrate it with a 
well known and widely used process definition language. Nowadays, there are several 
standard process definition languages such as XML Process Definition Language 
(XPDL) [25], Business Process Modelling Language [6] or more web-oriented lan- 
guages such as Business Process Execution Language for Web Services (BPEL4WS) 
[5] and Web Service Description Language [26]. 

So far it seems that XPDL and BPEL(4WS) are the most mature and complete 
process definition languages. Both these languages may be easily extended of BPQL. 
This integration will much extend the functionality of the existing language. As an 
example we present in the next sub-sections how it may be done in XPDL. 



1 



The object User represents an application user. The corresponding class is a part of applica- 
tion resource data. 
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4.1 Transition Condition 

In XPDL [25] a transition condition is expressed by the XML tag Transi- 
tions/T ransition/Condition with type CONDITION. However there is no specification 
how this condition should look like. Usually it is represented as a text. In this situa- 
tion, the XPDL definition may be extended by more precise definition which requires 
a condition to be written in BPQL. If a BPQL query returns one or more objects as the 
result, then the condition is satisfied. Otherwise, if the result is empty, it is not. An 
example written in XPDL may look like: 

<Transition Id="bl" From="ChckBalance" To="ProcRequest"> 

<Condition Type="CONDITION"> 

Order where 

(id = ThisProcessInstance.hasDataContainer.orderld and 
(value > 30000 or quantity > 100)) 

</Condition> <Transition/> 



4.2 Workflow Participant Assignment 

According to the WfMC’s definition [24], a Workflow Participant Assignment 
(WPA) defines the set of participants that will perform a given workflow activity. A 
participant can be one of the following types: a resource (specific resource agent), 
resource set, organisational unit, role (a function of a human within an organisation), 
human (a user) or system (an automatic agent). 

To be coherent with the above definition it is suggested to use BPQL to define a 
participant. BPQL definition could be included as an extended attribute of the partici- 
pant specification, if the participant is represented as a role. In this case, the WPA 
definition would remain the same while the participant definition would be expressed 
as a function that returns a set of participants. A BPQL query would return a set of 
workflow participants that would satisfy it. In addition, also WPA decision and the 
modifier introduced in [14] could be used to specify a workflow participant. An ex- 
ample written in XPDL may look like: 

<Participant Id="pl" Name=”Seller”> 

<ParticipantType Type="ROLE"> 

<Description>Seller</Description> 

<ExtendedAttributes> 

<ExtendedAttribute Name=”Definition”> 

User where (position = ‘Seller’) 

</ExtendedAttribute> </ExtendedAttributes> 

</Participant> 



4.3 Pre and Post-activity Condition 

So far, pre and post conditions for workflow activities are not defined directly in 
XPDL. However, it is possible to use the tag ExtendedAttribute to express these con- 
ditions which, once again, would be defined in BPQL. If a BPQL query returns one or 
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more objects as the result, then the condition is satisfied. Otherwise, if the result is 
empty, it is not. An example written in XPDL may look like: 

<Activity Id="56" Name="Compose Acceptance Message"> 

<Implementation> 

<Tool Id="composeMessage" Type="APPLICATION"> 

<ActualParameters> 

<ActualParameter>status</ActualParameter> 

<ActualParameter>orderNumber</ActualParameter> 

</ActualParameters> 

</Tool> 

</Implementation> 

<ExtendedAttributes> 

<ExtendedAttribute Name="Pre-Condition"> 

Order where (id = ThisProcessInstance.hasDataContainer.orderld and status = ‘closed’) 
</ExtendedAttribute> 

</ExtendedAttributes> 

</Activity> 



5 Practical Results 

The first version of BPQL has been implemented for workflow participant and inte- 
grated in OjficeObjects^ WorkFlow (00 WorkFlow) from Rodan System. This system 
has been deployed at major Polish public institutions (e.g. Ministry of Labour and 
Social Policy, Ministry of Infrastructure) as well as at private companies (e.g. San- 
plast Ltd.). 

XPDL, used in 00 WorkFlow to represent process definition, has been recently ex- 
tended of BPQL according to the suggestions about workflow participant assignment 
presented in the previous section. 

First practical verification of BPQL was done for the system for electronic docu- 
ment exchange between Poland and European Council (referred further to as EWDP). 
In this system 00 WorkFlow was used to implement the process for preparation of 
the Polish standpoint concerning a given case that was discussed at the European 
Council, COREPERS or its working groups. The process consists of more than 40 
activities and includes about 10 process roles. It is going to be used by all nineteen 
Polish Ministries and central offices with about 12000 users registered. Daily, there 
are about 200 documents processed. On the average, preparation of the Polish stand- 
point lasts about two or three days. 

Owing to BPQL, it was possible to generalise this process and make it suitable for 
all the offices. Complex rules to assign appropriate workflow participants have been 
quite easily expressed in BPQL, especially for (1) main coordinator which assigns 
Polish subjects to individual EU documents, (2) leading and supporting coordinators 
which assign experts to the processed document, (3) leading expert and supporting 
experts. These workflow participants are selected on the basis of possessed roles, their 
competence, current work-load and availability. Competence data are extracted from 
the system ontology. 

In addition, also the process owners got the better chance to modify the process 
definition without modifying the code of the system. So far, there were twelve me- 
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dium-scale changes of the process that were done only by modification of the process 
definition (early production phase). 



6 Related Work 

So far, there are a few other approaches to define a business process query language 
and use it in process definition. Firstly, in [7] the authors presented a language to 
model web applications - WebML. This language provides functionality to define 
business processes and make them more flexible. To define a condition or a workflow 
participant it is possible to use an attribute whose value may be calculated by a com- 
plex program. Despite its huge flexibility (every algorithm may be written as a proce- 
dure), this approach, however, seems to be less appropriate for non-programming 
process owner and to hide the algorithms to calculate these attributes inside the pro- 
gram code. In addition, at the best our knowledge this language is not compliant with 
any of the existing well known standard process definition languages. 

In [14] the authors proposed WPAL - a functional language to define workflow 
participant assignment. This language is able to use workflow control data (e.g. the 
performer of a given activity, reference to the objects represents the start activity). 
Despite its flexibility, it has similar problems as WebML has. BPQL defined in this 
article may be treated as a continuation and significant extension of WPAL. 

Finally, there is some work on a business process query language carried out by 
BPMI. It is promised that this language will offer facilities for process execution and 
process deployment (process repository). However, after two years of this work, still 
there is no official, even a draft version available [3]. 




Fig. 3. Work list for an EWDP user (in Polish) 
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On the other hand there are many workflow management systems that provide 
some selected elements of business process query languages. For example in 
Staffware [16], there is a set of workflow functions (i.e. SW functions) that can be 
used to define a conditions and a workflow participant. In Lotus Workflow [8], it is 
possible to call a lotus script function in order to define the above elements. OfficeOb- 
jects WorkFlow in her previous version implemented WPAL [14]. Unfortunately, all 
these examples of query languages do not provide clear and coherent semantics. In 
addition, the problem with moving algorithms form process definition into application 
still remains. 



7 Conclusion 

In this paper we have defined a workflow process metamodel and a business process 
query language BPQL to operate on this metamodel. In order to assure clear syntax 
and complete semantics of the language, SBQL has been used as the core specifica- 
tion. On the top of it BPQL was defined. The article also shows how BPQL may be 
used to make process definition more flexible and easy to modify. BPQL, following 
SBQL, can work on all data models that can be used for a workflow process meta- 
model, starting from the relational one, through XML-oriented, up to most advanced 
object oriented models. 

Despite these advantages the presented approach leaves several open issues. In 
some cases where relevant data come from several data sources it is impossible to 
make a BPQL query. Also BPQL is not helpful when the current control of process 
resources depends on factors that are not present in the metamodel, or factors that are 
too random. In such cases the workflow processes must be controlled manually. 

The first version of BPQL has been developed in OJficeObjects^ WorkFlow and 
implemented in the system for electronic document exchange between Poland and 
European Council to define process for answering European documents. Owing to 
BPQL it was possible to reduce flow complexity of the process and make it easier for 
further modifications. 
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Appendix: The (Simplified) Syntax of BPQL 



<query> 




<literal> 




1 


<name> 




1 


exists <query> ( <query> ) 




1 


all <query> ( <query> ) 




1 


<query> . <query> 




1 


<query> where <condition> 




1 


<query>join <query> 




1 


<query>as <aliasName> 




1 


<query> group as <aliasName> 




1 


<function> 




1 


(queryList) 




1 


if <query> then <query>else <query> 




1 


<algExpression> 


queryList 




<query> {, <query>} 


<condition> 




:= 


<logExpression> 


<logExpression> 




:= 


<logSum> 


<logSum> 




:= 


<logProduct> { or <logProduct>} 


<logProduct> 




:= 


<logSExpression> {and <logSExpression> } 


<logSExpression> 




:= 


not <logSum> | ( <logSum> ) | <logCondition> 


<logCondition> 




:= 


<leftSide> <opComp> <rightSide> 


<opComp> 




:= 


A 

A 

II 

II 

II 

V 

V 

A 

V 


<leftSide> 




:= 


<algExpression> 


<rightSide> 




:= 


<algExpression> 


<algExpression> 




:= 


<algSum> 


<algSum> 




:= 


<algProduct> { [ + | ■ ] <algProduct>} 


<algProduct> 




:= 


<algExpression> { [ * | / | % ] <algExpres- 
sion>} 


<algExpression> 




( <algExpression> ) 




^ L 


<symbol> | <literal> | ( <query> ) 


<function> 




:= 


<fName> | <fName> ( <queryList>) 


<symbol> 




:= 


<objName> {. <objName>} 


<literal> 




:= 


<text> 1 <integer> | <float> | <boolean> 


<boolean> 




:= 


true 1 false 
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Abstract. Workflow management systems support and automate the 
enactment of business processes. For this purpose, workflow management 
systems use process definitions that have been manually planned and 
modeled at build time. Recent research approaches try to enhance this 
concept by automating the creation of process definitions, using planning 
algorithms. This avoids the need for predefined process definitions and 
thus increases flexibility and allows to save costs. An important aspect of 
flexibility is the ability to react to unanticipated events that might occur 
during runtime. This Reaction can imply replanning and dynamically 
adapting the process. This paper shows how replanning can be triggered 
automatically in an integrated workflow planning and enactment system. 
Triggers for monitoring process executions are presented and events are 
defined which lead to the evaluation of corresponding conditions for de- 
ciding when replanning is necessary. Finally, advantages, limitations and 
areas of application of this approach are discussed. 



1 Introduction 

In todays dynamic markets, organizations have to take advantage of information 
technology to improve their business and stay competitive. A crucial point for 
the competitiveness of organizations is the performance of their processes [7,22]. 
Workflow management systems [6,9,10] are applied to support and automate 
process enactment. For this purpose, process definitions are use that have been 
manually planned and modeled at build time. Recent research approaches prop- 
agate the feature to plan processes definitions automatically to improve quality 
and flexibility [1,11,12,18,25]. An important aspect of flexibility is the ability to 
react to unanticipated events that might occur during runtime, by replanning 
and dynamically adapting the process. This paper shows how replanning can be 
triggered automatically in an integrated workflow planning and enactment sys- 
tem. Triggers for monitoring process executions are presented. Therefore, events 
are defined which lead to the evaluation of corresponding conditions for deciding 
when replanning is necessary. 

A process is a defined set of partially ordered steps intended to reach a 
goal [4] . It is the means to change a given situation in order to fulfill a company’s 
goal. The information about this situation including all relevant documents is 
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called a case. Examples of cases are an incoming purchase order that has to be 
handled or a sick patient who has to be cured. From an organizational point 
of view, the life cycle of a process includes the phases planning and enactment. 
Planning is the generation of a description called a process definition of what has 
to be done in which order to reach a particular goal. The subsequent phase is en- 
actment, which is the organizational task to schedule the work in accordance to 
the process definition. Workflow management systems take a process definition 
as input and use it to support and automate process enactment. Traditionally, 
planning and supplying the process definition to the workflow management sys- 
tem has to be done manually. Recently, advances in the automation of business 
process planning has been made. Automated planning allows to generate indi- 
vidual process definitions for every case. Thus, quality and flexibility can be 
improved. An important aspect of flexibility is the ability to react to unantic- 
ipated events that might occur during runtime, by generating a new process 
definition and dynamically adapting the process. Basically, there are three main 
steps in replanning: 

1. Trigger replanning when necessary 

2. Generate a new process definition 

3. Adapt process enactment to the changed process definition 

To fully automate replanning, all three steps have to be taken into account. While 
research in workflow management concentrates on step 3 to enhance flexibility, 
the integration of automated process planning allows to fully support replanning. 

This paper is based on the integrated planning and enactment system pre- 
sented in [18] and adds a concept for automated initiation of replanning. Triggers 
are specified to monitor process executions and start replanning if necessary. In 
principle, replanning is necessary, if the process definition assigned to a case is 
no longer adequate for further enactment. This paper defines two degrees of ad- 
equateness and identifies events that may threaten this adequateness. To each 
event a condition is assigned, specifying the exact circumstances under which 
the event makes replanning necessary. 

Related work is presented in Section 2. Section 3 introduces the relevant 
foundations on AI planning algorithms and workflow management. Section 4 de- 
scribes the overall workflow planning and enactment system and adds a concept 
for automatically triggering replanning. Finally, in Section 5 areas of application, 
and unsolved problems are discussed. 



2 Related Work 

There are two major areas of related work: research on dynamic adaption of 
workflows and approaches to automatically generate workflow process defini- 
tions. 

Dynamic adaption of workflows deals with adjusting running process in- 
stances to changes in the process definition [3,15,17,8]. Nevertheless, all this 
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work is concentrated on step 3 of replanning - the adaption of process enact- 
ment - as described in Section 1. Triggering replanning and generating a new 
process definition is expected to be done manually. 

Recent research approaches present options to automate the generation of 
processes definitions: For instance, in [1] ontologies of domain services and do- 
main integration knowledge are used that serve as a model for workflow inte- 
gration rules. DYfiow [25] avoids the use of predefined process definitions and 
allows the dynamic composition of Web services to business processes by apply- 
ing backward-chain, forward-chain and data flow inference. Another approach 
dealing with automated composition of web-services is described in [14]. Seman- 
tics for a subset of DAML-S [21] are defined in terms of a first-order logical 
language, to enable automated planning. In [11] the composition of single tasks 
or subgraphs of workflows is embedded in a case-based framework for work- 
flow model management. In [12] the application of contingent planners to exist- 
ing workflow management systems is discussed. To generate process definitions 
some of these approaches use Artificial Intelligence (AI) planning algorithms, 
while others use proprietary developed algorithms. Although AI planning algo- 
rithms have been in research for more than 30 years [5] , advances in recent years 
make their application on real business domains promising [13,23]. The issue of 
replanning is discussed by some of these approaches, but step 1 of replanning - 
triggering replanning - is not explicitly taken into account. 

3 Preliminaries 

There has already been done a lot of research on planning and workflow enact- 
ment in the areas of AI planning algorithms and workflow management systems 
respectively. This section presents the concepts from both research areas that 
are important for an integration of workflow planning and enactment. In the 
remainder of this paper, common concepts of both areas are described using one 
continuous terminology, instead of applying the area specific terminology at each 
case. The example introduced in this section will be used throughout the paper 
to illustrate the integration to an overall system. 

3.1 Planning Algorithms 

In this section, foundations on AI planning are presented which are relevant for 
triggering replanning in an integrated workflow planning and enactment system. 
Furthermore, an example is introduced that is used throughout the paper. 

Planning algorithms [5,24] take a description of the current state of a case, a 
goal, and a set of activity definitions as input. The goal constitutes the desired 
state of the case and where required a metric to optimize. The means to trans- 
form the case from its actual state to the desired state are the activities. An 
activity is a piece of work that forms one logical step within a process. An ac- 
tivity definition defines the conditions under which an activity can be executed, 
called preconditions and its impacts on the state of the case, called effects: 
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Definition. An activity definition d consists of 

— a set precd of preconditions 

— a set effd of effects 

A domain is a set of activities definitions. To sum up, the input of a planner is 
a planning problem defined as follows: 

Definition. A planning problem P consists of 

— a domain damp 

— an initial state initp 

— a goal goal p 

The initial state and the goal are logical descriptions of the state of the case. 
Planning is the task of finding a partially ordered set of activities that, when 
executed, transforms the case from its current state to the goal. The description 
of this set of activities and their ordering - the process definition - is the output of 
the planning algorithm. To illustrate the basic procedure of a planning algorithm 
and the relationship between its input and output, the planning of a simple 
process is presented. The example domain D consists of five activity definitions 
di to d^ that are described using the Planning Domain Definition Language [16] 
(PDDL). The corresponding definition given in extracts: 

(define (domain D) 

(: action dl 

: precondition (and (v) (t)) 

: effect (and (y) (increase (costs) 50))) 

(: action d2 

: precondition (x) 

: effect (and (z) (u) (increase (costs) 20))) 

(: action d3 

: precondition (x) 

: effect (and (v) (t) (not (z)) (increase (costs) 30))) 

(: action d4 

: precondition (z) 

: effect (and (w) (increase (costs) 10))) 

(: action d5 

: precondition (x) 

: effect (and (w) (increase (costs) 40))) 

(: action d6 

: precondition (x) 

: effect (and (t) (increase (costs) 10))) 



) 
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Fig. 1. Process Definitions p\ and p 2 



Activity definition di has the precondition v and the effect y. For instance, 
V is the fact that a customer name is determined and y is the fact that the 
address of the customer is determined. Thus, activity definition di determines 
the address of a customer based on his name. Besides these functional properties, 
non-functional properties are specified: The costs for the execution of d\ are 50. 
To define a concrete planning problem, not only the domain has to be specified, 
but also the initial state and the goal. The corresponding PDDL definition given 
in extracts: 

(define (problem P) 

( : domain D) 

( : init (x) (= (costs) 0) 

(:goal (and (w) (y))) 

(: metric minimize (costs)) 

) 

The planning problem P is specified as follows: The initial state is a; A {costs = 0) 
and the goal is w A y with the costs to minimize. To complete the input for a 
planning algorithm, two extra activity definitions are introduced: an activity 
definition Start without preconditions and whose effect is the initial state of the 
case and an activity definition Finish with the goal as a precondition and no 
effects. The task of a planning algorithm is to find a process definition that solves 
P. Process definition pi depicted in Fig. 1 is a solution for P. The activities are 
partially ordered in a way that the effects of preceding activities satisfy the 
preconditions of subsequent activities. For example, d 2 has the effect c that is 
needed by c ?4 as a precondition. Such a link between the effect of one activity 
that satisfies the precondition of another activity is expressed by a causal link, 
depicted as a dashed arrow. Every causal link implies an ordering constraint. 
For example, the ordering constraint from da to c?i is derived from the causal 
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link between these activities. Additionally, ordering constraints are also inserted 
to protect causal links. For example, the ordering constraint from to d 2 is 
inserted to protect the causal link from c ?2 to d^. If there would be no ordering 
constraint from d^ to d 2 , d^ could be executed between c ?2 and ^ 4 . This would 
be problematic, because d^ negates the effect z of c ?2 that is needed by d^ as a 
precondition. 

Process definition pi is a solution of P. The minimized, overall costs are 110. 
There exists another process definition p 2 depicted in Fig. 1 that also produces 
the facts w and y, if started in an initial state in which fact x holds. Since the 
overall costs of p 2 are not minimal, it is no solution of P. 



3.2 Workflow Management 

The purpose of workflow management systems [6,9,10,19] is to support and au- 
tomate the enactment of business processes. During workflow enactment pieces 
of work have to be passed to the right participant at the right time with the 
support of the right tool. A central property of a workflow management system is 
that this functionality is not hard coded for a specific process, but implemented 
in a way that the system can take any process definition as input. For a workflow 
management system, the relationship between effects and preconditions as the 
causal origin of the ordering of the activities is not relevant. Therefore, a process 
definition as input for a workflow management system contains only ordering 
constrains and no information on preconditions, effects nor causal links. Next to 
the process definition, the workflow management system needs information on 
the potential participants of the process and the available tools. 

Before the execution of a process, all activities are in the state init. The ex- 
ecution of a process starts with the activities in the process definition that have 
no predecessors. For example in process definition pi depicted in Fig. 1 activ- 
ity 3 is scheduled first. When an activity is executed the workflow management 
systems automatically starts the appropriate tool with the data that has to be 
processed in this activity. By starting an activity its state changes to running. 
After execution completes, the state of an activity changes to done and its sub- 
sequent activities are started. This iterates until all activities are in the state 
done and the process is completed. 



4 Replanning in an Integrated Planning and Enactment 
System 

This section is divided into three subsections: In Section 4.1, the overall frame- 
work for an integrated workflow planning and enactment system is presented. 
Section 4.2 carries on by adding a concept for replanning. Finally, Section 4.3 
describes triggers for replanning in detail. 
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Organizational Model 




4.1 Framework 

The two central sub-systems of the framework are the planner and the coordi- 
nator. The planner generates a process definition for each case. The coordinator 
then uses the process definition to schedule the activities. In the following, the 
basic functional and behavioral aspects of the interaction of these components 
are described. Fig. 2 gives an overview of the interrelation of input and output 
between planner and coordinator. The basis for the overall system is the formal 
description of the domain, the participants and the tools. They describe the 
means of the company to process individual cases. A case is defined as follows: 

Definition. A case c consists of 

— a state statCc 

— a goal goalc 

To plan a process definition to handle a case c, a corresponding problem defini- 
tion P has to be specified. This is done by mapping statCc to initp and goalc 
to goalp. The domain of the problem definition is the domain of the organiza- 
tional model. The planner takes the problem definition as input and generates 
a process definition as output. A process definition is defined as follows: 

Definition. A process definition p consists of 

— a set instp of activity instances 

— a set conUp C instp x instp of control connectors 

Definition. An activity instance i based on activity definition d consists of 

— an execution state execi € {init, running, done} 

~ a set acti of actual effects 

~ a set reli C ef fd of relevant effects 

An activity instance is the representation of an activity definition in a process 
definition. In addition to the information in its corresponding activity definition. 
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an activity instance contains information on its execution state and its effects. 
This information is important for triggering replanning as it is explained in 
Section 4.3. The limitation to three different execution states of activity instances 
is a strong simplification for the purpose of this paper. The set acti contains the 
effects the activity actually had during execution. The set rek contains the effects 
of effd that are relevant for the process. An effect e G effd is relevant iff, ceteris 
paribus, the planner would generate another process definition, if e ^ s-ffd- The 
relevance of an effect in effd depends on the problem definition. Consider the 
example introduced in Section 3.1. The activity definition d ,2 has the effects z A 
u A increase costs by 20. The effect u is not relevant for pi, because the same 
process definition would be generated, no matter if c ?2 has the effect u or not. 
Thus, the relevant effects rel 2 are z A increase costs by 20. 

After the process definition is planned, it is assigned to the case. The process 
definition combined with information on participants and tools, is the input of 
the coordinator. This allows the coordinator to schedule the execution of the 
activities in accordance to the process definition until the goal is reached. The 
dashed arrow from the coordinator to the state of the case depicted in Fig. 2 
illustrates the effects of the activities on the case, which are scheduled by the 
coordinator. 

To illustrate the behavior of the overall system the example from Section 3.1 
is carried on. Let us assume a case c should be handled that is specified as follows: 
statCc is X and goalc is w A y with the costs to optimize. First, a corresponding 
problem definition P has to be specified. Therefore, statCc is mapped to initp 
and goalc is mapped to goalp. A solution for P is the process definition pi 
depicted in Fig. 1. The rectangles stand for the activity instances ii to that 
represent the activity definitions di to in pi. All activity instances are in 
the state init. pi is given to the coordinator. Execution begins by starting ^ 3 . 
While running, *3 has the anticipated effects v A t A -iz. After 13 completes, 
the subsequent activity instances A and i 2 are started. The execution continues 
until all activity instances are in the state done and the goal is reached. 

4.2 Replanning 

An integrated planning and enactment system allows to automatically replan 
process definitions if necessary and adapt the process enactment. Therefore, the 
interaction between planner and coordinator becomes more interlaced . Consider 
the example above. Let us assume that *3 completes without having all antici- 
pated effects. For example it has the effects v and -iz but not the effect t. In this 
case replanning becomes necessary, because a precondition of the subsequent ac- 
tivity instance ii is not satisfied. The coordinator that monitors the execution of 
the process, notes the missing effect and triggers replanning. Replanning is done 
by planning a new process definition based on the current state of the case. In 
the example, the current state of the case statCc is x A u A ->z. A new planning 
problem P' is defined and the planner generates a process definition p^ shown in 
Fig. 3 *3 is not part of ps, because the effect v is not needed any more, and the 
missing effect t can also be produced by ie at a lower cost. p 3 is assigned to the 
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Fig. 3. Process Definition ps 



case and given to the coordinator, which then adapts the process enactment to 
the new process definition. In the example above replanning became necessary, 
because an activity instance completed and an anticipated effect was missing. 
Next to anticipated effects that are missing, other events can occur that make 
replanning necessary. In the following, a general approach to trigger replanning 
automatically is presented. 

As a basic principle, replanning is necessary, if the process definition that is 
assigned to a case becomes inadequate. While initially the planner generates an 
optimal process definition, the example above shows that certain circumstances 
can make it inadequate for further enactment. To state the adequateness of a 
process definition more precisely, the properties working and optimal are defined: 

Definition. A process definition p is working for case c, iff applied on c, it 
transforms statCc to goalc- 

Definition. A process definition p is optimal for case c, iff it is working for c and 
no other process definition working for c is better in relation to the optimization 
metric. 

For example, initially the process definition pi is optimal for a case c. Process 
definition p2 is also working for c, but it is not optimal. It has to be considered 
that the properties working and optimal also refer to process definitions that 
have already been started: The state of the case may already have changed and 
that the execution state of the activity instances may already be running or 
done. In this case the the properties working and optimal refer to the unfinished 
part of the process definition and the current state of the case. For example, 
let us assume that the process definition pi is already partially executed, i^ has 
completed and has all anticipated effects. pi is still optimal, because it transforms 
the current state statCc'- x A v A t A ~<z to the goal state goalc- y A w. If is 
completes without having the anticipated effect t, pi is not working for the case 
c any more. As will be shown, process definitions can also loose the property 
optimal without loosing the property working. 

The purpose of replanning is that a case is always assigned to an optimal 
process definition. It is assumed, that a planner initially generates an optimal 
process definition for a case. The process definition is then assigned to a case 
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for execution. During execution, the process definition may loose the property 
optimal. In this case, replanning has to be triggered to generate a new, opti- 
mal process definition. If a new process definition is found, proess enactment is 
adapted. If the planner is unable to find a new process definition, the goal can 
not be reached any more and human intervention becomes necessary. 



4.3 Triggers for Replanning 

In this subsection a detailed description of triggers for replanning is given. The 
presented set of triggers does not claim to be complete, but it covers typical 
causes for replanning. The triggers are specified in form of Event Condition 
Action (EGA) rules [2] . Events are identified that threaten the property optimal 
of the associated process definition. To avoid unnecessary replanning, conditions 
are defined for every event to specify the exact circumstances under which the 
event may threaten the property optimal. The triggers in detail: 

Trigger 1: 

Event: activity instance i has effect e 

Condition: e ^ e.f fd with activity instance i based on activity definition d 
Action: Replan 

In the domain, effects are defined for each activity that are anticipated from its 
execution. For example, the effect y is anticipated form the execution of ii. The 
planner uses this information to plan an optimal process definitions for a case. 
During execution, an activity may have effects that are not anticipated. In this 
case, the input of the planner was incorrect. Thus, it can no longer be guaranteed 
that the process definition is optimal. For example, if has an unanticipated 
effect V, Pi is still working for c, but it is not optimal any more. pi is working, 
because if the remaining activities are scheduled in accordance to pi, the goal 
will be reached. Nevertheless, pi is not optimal, because pa is also working for c 
and has lower costs. In contrast to this example, activities can also have effects 
threatening the property working. The trigger to handle unanticipated effects is 
defined as follows: The event for Trigger 1 is an activity instance i having an effect 
e. The condition of Trigger 1 checks, if the effect was anticipated. The domain 
defines the anticipated effect for each activity. Therefore, it is checked, whether 
the effect e was defined in the activity definition d of the activity instance i. If 
this is the case, the input of the planner was correct and it is guaranteed that 
the process definition is still optimal. Otherwise, the process definition may have 
lost this property and replanning is necessary. 

Trigger 2: 

Event: activity instance i completes 
Condition: reU — acU yf 0 
Action: Replan 
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Next to unanticipated effects, an activity can have less effects than anticipated. 
This complies with the example given at the beginning of Section 4.2, where 
is missing the effect t. This can make the process definition loose the property 
working. In the example above, p\ looses the property working, because a pre- 
condition of the subsequent activity instance i\ is not satisfied. As an activity 
can have effects as long as it is running, the proper time to check if it had all 
anticipated effects is when it completes. Therefore, the event for Trigger 2 is an 
activity i that completes. If effd — acti ^ 0 there is at least one effect e G effd 
defined in the activity definition that did not occur during the execution of the 
activity instance. Thus, the input of the planner was not correct and it can not 
be guaranteed that the process definition is still optimal. Nevertheless, not all 
effects in the activity definition are always relevant for planning a concrete pro- 
cess definition as explained in Section 4.1. Given the set reli of relevant effects 
for each activity instance, unnecessary replanning can be avoided by defining 
reli ~ o.cti 7 ^ 0 as the condition for Trigger 2. In the example above, the relevant 
effects rel 2 for i 2 are z A increase costs by 20. If i 2 completes and the effect 
u is missing, there is no need to replan. In contrast, if the effect z is missing, 
replanning is necessary. This example shows that additional information on rel- 
evant effects for each activity instance allows to avoid unnecessary replanning. 
For this reason the planner adds this information to each activity instance when 
generating the process definition. Please note that a check acti — effd yf 0 is not 
necessary, because unanticipated effects have already been taken into account 
by Trigger 1. 

Trigger 3: 

Event: goalc of case c changes 
Condition: true 
Action: Replan 

The event for Trigger 3 is a change in the goal of a case. As a result the associated 
process definition may loose the property optimal. For example, consider the 
execution of pi. Let us assume that *3 is in the state running. Due to an external 
event, e.g. a customer request, the goal state goalc is changed from w A y to y. 
As a result, pi is not optimal any more, because 12 and id have become needless. 
Furthermore, if goalc is changed to m, pi even looses the property working. 
From this it follows that a change in the goal of a case must trigger replanning 
to assure to have an optimal process definition. 

Trigger 4: 

Event: external effect on state statec of case c 
Condition: true 
Action: Replan 

Trigger 4 handles external effects on the state of the case. External effects are 
ad hoc effects that can not be assigned to the execution of an activity instance. 
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For example, pi is executed and 13 is running. The coordinator has assigned 
is to a participant for execution. Now external events may occur that have an 
effect w on the case, for example: a customer calls, pieces of information are 
lost or a new law is implemented in the company. Due to the effect w, 12 and 
t4 become needless. Thus, p\ is not optimal any more. Generally speaking, an 
external effect is always an unanticipated effect and thus replanning becomes 
necessary. 

Trigger 5: 

Event: activity definition d changes 

Condition: 3 an activity instance i based on d with exeCi G {init, running} 
Action: Replan 

The triggers 5 and 6 deal with changes in the domain. The event for Trigger 
5 is a change in an activity definition d. The preconditions or effects of d may 
change or d becomes unavailable. This event make replanning necessary, if the 
process definition assigned to the case contains at least one activity instance 
based on d that is not already done. Changes in activity definitions of finished 
activity instances are not threatening the properties working and optimal, be- 
cause the domain only specifies the anticipated behavior of activities instances. 
Thus, activity instances that are already finished are not affected by domain 
changes. 

Trigger 6: 

Event: additional activity definition becomes available 
Condition: true 
Action: Replan 

If additional activities become available, the current process definition may not 
be optimal any more. For example, pi is executed and is is running. Consider 
a new activity definition dr that is equal to ds with the exception that is costs 
only 10 . Thus, another process definition p4 that replaces *2 and 14 is better 
in matters of costs. As new activities can make the current process definition 
loose the property optimal, replanning should to be triggered. Please note that in 
contrast to Trigger 5 an additional activity definition only threatens the property 
optimal and not the property working. Thus, if it is sufficient to have a working 
process definition. Trigger 6 can be renounced to save replanning effort. 

5 Discussion 

In this paper we specified triggers that enable automated initiation of replan- 
ning. For this purpose, events were identified that threaten the adequateness of 
process definitions. Conditions were assigned to every event, to specify the exact 
circumstances under which the event makes replanning necessary. In particular, 
a set of relevant effects reli for an activity instance i was defined and it was 
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shown, how it can help to avoid unnecessary replanning. It is important to men- 
tion that relevant effects are not part of the standard output of an AI planner. It 
is possible to check if an effect is relevant, by planning again without this effect 
and comparing the resulting process definition with the original - that is the 
way relevance was defined. This approach may not be feasible, because for every 
effect of every activity instance a process definition has to be planned to check 
relevance. Instead relevant effects can be determined directly during planning. 
Basically, the information on causal links and causal link protection generated 
during planning, has to be analyzed. A detailed description is beyond the scope 
of this paper. The presented approach is a step towards fully automating replan- 
ning. Especially, applications in which replanning is time crucial or has to be 
done without human interaction, can benefit form the ability to automatically 
trigger replanning. The presented triggers are part of the replanning concept of 
an integrated planning and enactment system called Plaengine. Currently a first 
prototype is realized that is able to take a domain and problem definition in 
PDDL as input, to generate a process definition in the Business Process Exe- 
cution Language for Web Services [20]. The process definition is then given to 
the coordinator for process enactment. Automated triggering of replanning is 
included in the second prototype that is currently in the design phase. 

Acknowledgments. The authors would like to thank Jens Bundling for his 
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Abstract. Implementation of the data stream processing applications 
requires a method for formal specification of the computations at a 
dataflow level. The logical models of stream processing hide the lower 
level implementation details. To solve this problem, we propose a new 
model of data stream processing based on the concepts of relational data 
stream, extensible system of elementary operations on relational streams, 
and data stream processing network integrating the dataflows and ele- 
mentary operations. Next, we present the transformations of grouped 
data stream processing applications into data stream processing net- 
works. The transformations proposed in the paper integrate the networks 
and optimize the implementations through elimination of the redundant 
elementary operations and dataflows. Finally, the paper introduces a 
timestamp based synchronization of data flows in our model and dis- 
cusses its correctness. 



1 Introduction 

Data streams naturally occur in many modern applications of information tech- 
nologies [1]. Processing of data streams includes storing, manipulating, and fil- 
tering the theoretically unlimited sequences of data items propagated by the 
sensor devices [2], financial institutions, traffic monitoring systems, etc. At the 
logical level, a traditional SQL based specification of a data stream application 
makes it very similar to a traditional database application [3,4]. Unfortunately, 
due to the performance reasons, the implementation techniques developed for 
the conventional database systems cannot be directly applied to the processing 
of rapidly changing sequences of data items [5] . Specification of the computations 
performed on the elements of data streams requires a different system of elemen- 
tary operations and different organization of data flows among the operations. 

This work introduces a concept of relational data stream and defines a new 
system of dataflow level operations on the relational streams. The new system 
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allows for transformation of the grouped data stream applications, i.e. different 
applications that operate on common data streams, into the data stream pro- 
cessing networks and optimization of the networks. The data flow operations 
reflect the principles of reactivity, adaptability, and extensibility of data stream 
processing. Reactivity means the processing of a new data element as soon as it 
is appended to a stream. Adaptability means automatic reaction of the system to 
the dynamically changing situations. For example, shifting more computational 
power into the processing of data streams that suddenly increased the frequen- 
cies, or changing the processing priorities in order to eliminate the bottlenecks. 
Extensibility means the ability of the system to create the complex operations 
from the elementary ones through pipelining, symmetric composition, or encap- 
sulation of the elementary operations. To enforce the principles listed above we 
extend the traditional models of data streams with the concepts of composite 
data items (tuples) combined with insert or delete operations. A template 
for the elementary operations rather than a fixed set of operations assures the 
extensibility of the system. The elementary operations concurrently processing 
the data items against the varying contents of relational tables provide a high 
level of reactivity of the system. Adaptability of the system can be achieved 
through the identification of equivalent transformations of the common designs 
into a number of data stream processing networks. 

The paper is organised as follows. Section 2 contains a brief review of the 
origins and recent contributions in the area of data stream processing systems. 
Section 3 presents a formal model of relational data streams and extensible sys- 
tem of operations on the streams. Section 4 introduces the concepts of paths, 
data stream processing networks and proposes the transformations of the rela- 
tional algebra based data stream applications into the data stream processing 
networks. Optimization of the grouped applications is discussed in the same 
section. Section 5 considers synchronization of elementary operations and data 
flows between the operations. Finally, section 6 concludes the paper. 



2 Previous Works 

The origins of data stream processing techniques can be traced back to adaptive 
query processing [6,7,8,9,10] continuous query processing, [11,12,4,13], online 
algorithms [14,15,16], and large number of works on data flow processing, see 
the reviews [17,18]. A comprehensive review of many works that contributed to 
the various aspects of data stream processing can be found in [1]. 

A typical data stream processing model applies the pipes and/or queues to 
connect the operations on the elements of data streams in a way that outputs 
of one operation become the inputs of another operation [19]. The STREAM 
project [20] expresses the query execution plans as the expressions over the 
unary and binary operations directly derived from the relational algebra. The 
operations are linked with the queues and have access to the synopsis data struc- 
tures that implement the most up-to-date views on data streams. Execution of 
the operations is controlled by a central scheduler that dynamically determines 
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amount of time available to each one of them. The CACQ [19] and TelegraphCQ 
[5] systems treat the data streams as the infinite relational tables. The system 
uses the ’’Eddy” [21] operator to dynamically process the items appended to 
data streams. The Aurora and Medusa projects [22] target implementation of 
high performance stream processing engine. Optimization of stream processing 
described in [23] is based on a model of computations where the relational al- 
gebra like operations on data streams are linked with the queues. Approximate 
computation of relational join over the data streams and the complexity of set 
expression processing are presented in [24] and [25]. 



3 Data Stream Processing Model 

3.1 Basics 

A raw data stream is an unlimited sequence of elementary values, usually num- 
bers. A prepared data stream is an unlimited sequence of composite data ele- 
ments like records, tuples, objects, etc. A prepared data stream is obtained by 
’’zipping” of a number of raw data streams. A relational data stream is a pre- 
pared data stream such that all its elements are pairs <a,t> where insert, 
delete} and t is a tuple of elementary data items. When a pair <insert,f> is 
collected from a stream, a tuple t obtains the most up to date timestamp and 
it is recorded in a fixed size window over the stream. If the window is full then 
the tuple t' with the oldest timestamp is removed from the window and pair 
<delete,T> is inserted into an output stream of the recording operation. An 
empty slot in the window is filled with t and pair <insert,t> is inserted into an 
output stream of the recording operation. If a pair <delete,f> is collected from 
a stream by the recording operation then a tuple t it is removed from the window 
and the pair is inserted into an output stream. The recording operation is always 
the first operation executed after a new element is collected from a stream. 

The semantics of the data stream applications are expressed at the logical 
level in the terms of windows and operations on prepared data streams. Let 
Wsi , • ■ • , Ws„ be the windows on data streams si, . . . , s„. Consider application 
a expressed as a relational algebra expression ea{wsj, ■ ■ ■ ,Ws„). Let t\, . . . ,tk 
denote the time spots when the new data elements are collected from the input 
streams. Then, the evaluation of expression at fi, . . . tj, provides a sequence of 
values ea(fi), ■ . . ea{tk). These values are the results of the continuous processing 
of application a at ti,. . . ,tk- Obviously, the recomputation of entire expression 
ea(u>sj , . . . , Ws„) at ti, . . . , tfc is not feasible. We need the operations that describe 
the minimal sequence of actions required to recompute an application after an 
arrival of a new data item. 



3.2 Elementary Operations 

A system of elementary operation on data streams includes the housekeeping 
operations like recorder described in the previous section and injector used to 
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merge the data streams. The other group includes the operations similar to 
the relational algebra operations and derived from an operation template. An 
injector operation ((5 s) takes on input an element S from a data stream 
and inserts it into a processing sequence on another data stream s. A recorder 
operation (J — > Ws) acts either on a window on input stream Wg or on a 
set of tuples that play a role of an intermediate data container. When the 
recorder processes < insert ,t> on a set r then t becomes a member of r and 
<insert,t> is appended to the output stream of the recorder. Processing of 
<delete,t> removes t from r and sends <delete,t> to the output stream. The 
operations derivable from the template take on input a single element 5=<a,t>, 
a € {insert, delete} from a stream Si, optionally the contents of window Wj 
on a data stream Sj , and send the results to n output streams Souti , ■ ■ ■ Sout„ , 
see Figure 1. The template is a set of pairs . . . , where for 

i = 1, . . . n, is a filtering expression and Oi is a transformation expression. A 
filtering expression is a formula that evaluates to either true or false over a 
tuple t and the static contents of window Wj on stream Sj. A transformation 
expression is a sequence of elementary transformation operations 

like projection, aggregation, arithmetic transformations, etc on a tuple t and 
optionally the contents of Wj. The semantics of an operation derived from the 
template is equivalent to the semantics of the following pseudocode. 

if <Pi{t,Wj) then 0i{t,Wj) ^ Sout-,', 

if <p 2 {t,Wj) then 02 {t,Wj) ^ Sout^\ 

if then On(t,Wj) ^ Sout„ 

The elementary operations are derived from the template by the instantiations 
of expressions and 0i, selection of an input stream Si, window Wj, and the 
output streams. For example, an operation agpj that implements concatenation 
of selection (cr^), projection (tTx), natural join (M) of a tuple t and window 
Wj, and projection of the results is derived from the template by the 

substitutions <Pi ^ crj,{t) and 0i ^ 7Tx(t ixi Wj) and discarding the rest of the 
template. The semantics of agpj is defined by a pseudocode: 



if aj,(t) then TTx{t M Wj) Sout\ 
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In another example the derivation of operation Uavg that selects from a 
window Wj all data items that have the same values of attribute a as a tuple t 
and computes average of all values of attribute b from the selected items provides: 

if true then avgb(ci{t Wj)) -)> Sout\ 

A transformation operation 7 creates an aggregated data item from the results of 
t Wj and function avgb finds an average value of attribute b in the aggregated 
data items. 

An operation —left that takes on input an element <a,t> and computes 
{t} —X Wj is derived as follows. 

if Vr S Wj t[X] ^ r[X] then 

if a = insert then <insert,f> — >■ Sout else <delete,f> — >■ So^t endif 
endif 

The terms <insert,f> — > Sout ^ <delete,f> — > Sout denote the insertions of 
the elements <insert,f>, <delete,f> into an output stream of the operation. 
A similar operation —right takes an element <a,t> and computes Wj —x {t} in 
the following way. 

if 3r g Wj r[X] = t[X] then 

if a = insert then <delete,t> — >■ Sout else <insert,f> — >■ Sout endif 
endif 

The implementations of —left and —right are different because a set differ- 
ence operation is not commutative. The commutative operations need only one 
variant of the respective operation. For example, an operation Uspj has the same 
implementation for both {t IXI Wj) and {wj N t). It is easy to show that all re- 
lational algebra like operation on a stream element and window can be derived 
from the template. Moreover, the template allows for the derivations of the com- 
posite ’’piped” operations, i.e. the operations where the data items flow from one 
stage of processing to another without being recorded in the persistent storage. 
The piped operations significantly improve performance of data stream process- 
ing. The next section shows how to implement the relational algebra operations 
on the data streams as the networks of the elementary operations derivable from 
the template. 

3.3 Paths and Networks 

A data stream processing path or just a path, is an expression p:ti \ . . . | where 
p is a path name and for i=l,..,n, ti denotes either an elementary operation or 
injection point. The adjacent elementary operations in a path are ’’piped” such 
that outputs generated by ti are consumed by its successor If an operation 
ti contributes to more than one output stream then identification of an output 
used by a path is attached to ti. An injection point is a location among two 
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Fig. 3. A network implementation of (r(ofo) N;, s{bc)) Mac t{ac) 



adjacent operations in the same path where an injection operation inserts new 
elements from another stream. The injection points are needed to merge two 
or more paths. A set of paths describes a system of operations performed on 
the relational streams. For example, a system of operations given in Figure 2 is 
described by the paths: 

Pi : ai I a2{out2) | 04 | a: | as, 

P2 : a2(outi) I 03 I ^ a;. 

All operations in a path are binary operations. An operation is a pair aw 
or wa where a is an operation code and w is a data container read by an 
operation, e.g. {wi M^,) or {—leftWj). An operation code may be prefixed with a 
tag to uniquely identify the operation in another path. A symbol ’|’ means that 
elements produced on output by one operation become the input elements of 
its successor. An example given in Figure 3 and collection of paths given below 
implements a logical level expressions r{ah) Nf, s{hc)) M^c t{ac) where r, s, ad 
t are the relational streams. A window Wout contains the results of the expression. 

Pr ■ yWj- I I ^ac^t I 

Ps :^Ws I Wr^b I ^acWt \ ~^WouU 

Pt :^Wt I XlbU>s I MacWr I -^Wout 
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4 Applications 

4.1 Single Expressions 

An implementation data stream application needs a transformation of a logical 
level specification of the application into its implementation in a form of a data 
stream processing network. Let e(ri, . . . ,r„) be a relational algebra expression 
over the relational streams ri,. . . ,r„. An ad-hoc translation of e into an equiva- 
lent data stream processing network is performed in the following way. Consider a 
stream € {ri, . . . , r„}. Consider an operation ai(rj, e') where e' is a subexpres- 
sion of e. Construct a path pf.aiWe/ where Wf,' is set container with the interme- 
diate results of subexpression e' . Consider an operation ai+i{ai{ri, e'),rj) whose 
arguments include the result of operation ai and argument rj . Extend the path 
constructed in the previous step to pi :aiWe' \ at+iTj. Consider the next operation 
and extend the path once more. Repeat the process until no more operation is left 
in e. Then, append an operation -^Wresuit to a path pi. Repeat, this process for 
all input streams ri,. . . ,r„. If We' is a set container with the intermediate results 
then insert an operation -^We' to all paths that contribute to the contents of We' ■ 
Add a recorder operation -^Wi at the beginning all paths pi whose inputs are di- 
rectly taken from the data streams. Application of the method described above to 
a logical level expression r{{ab) txij s{bc)) t{ac) provides the following paths: 

Pr' yWr I \ ^'^rs \ ^ac^t \ ^'^resultt 

Ps- I Wj'^b I ^'^rs I ^ac^t | ^^r&sulti 

Pt- ~^Wt I IX w rs I ^'^result- 

A straightforward transformation of a relational algebra expression into a set 
of paths always assumes that intermediate results are recorded in the persistent 
storage. Access and maintenance of the persistent storage decreases the perfor- 
mance and increases the complexity of synchronisation of dataflows. The problem 
can be avoided if it is possible to transform a syntax tree of the expression into 
a left- (right-) deep syntax tree for each one of its arguments Vi is located at the 
left- (right-) deep leaf level. Let a(ri, /3(r^ , e'(ri, . . . , r^-i))) be a subexpression of 
the expression. If the operations a and (3 commute then we transform the expres- 
sion into (3{rj, a{ri, e'{ri, . . . , rj_i))). The transformation brings an argument 
closer to the left- (right-) deep leaf level. As a simple example consider an expres- 
sion r{{ab) [Xf, s{bc)) t{ac) over the relational streams r, s, and r. The expres- 
sion is equivalent to {t{ac) IX^ s{bc) r{ab) where an argument t{ac) is located 
at the left-deep leaf level. In such a case, construction of a path pt from the new 
expression provides p'p. -^Wt \ ><cs(6c) | l>^abr{ab), which does not accesses the in- 
termediate results, see Figure 3. The dots represents the recorder operations, the 
trapezoids represent the data containers and edges from containers to elementary 
operations represent read actions performed by the operations when processing 
the stream elements. The edges between the elementary operations represent the 
flows of relational data streams. The elementary operations are implemented as 
simultaneously running processes that collect the stream elements from the input 
queue, read data containers and append the results to an output queue. 
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4.2 Grouped Expressions 

Implementation of the grouped applications requires an optimal transformation 
of several logical level expressions into one data stream processing network. It 
is essential that operations and sub-paths common to several paths are imple- 
mented as the same processes. The main objective of the optimal transformations 
of grouped expressions is to minimize the total number of elementary operations 
involved in the implementation. As a simple example consider two data stream 
applications represented at a logical level as the relational algebra expressions 
Cl : r{ab) Xl^ s(ac) and 62 : t{cd) Nj, s(ac) where r and s are the relational data 
streams. Translation of the expressions ei and 62 provide the following collection 
of data stream processing paths: 

Pr^: -^Wr I ^aWs \ -^WouH Pt^' ~^Wt \ ^dWs \ ~^Wout2 

Ps^: -^Ws I I -^Wouti Ps2- ~^Ws I ^dWt I ~^Wout2 

To reduce a number of processes the join operations and can be 

implemented as a single join operation that merges the functionality of 
both operations. The new operation recognizes the identifiers of elements from 
streams r and t and outputs the results to outr or to outt respectively. Moreover 
recording of input elements of a stream s can be expressed as a single operation 
and operation ^ that replicates a stream s on the outputs outs^ and outg^ for 
processing against r and t. The data processing paths after the optimizations 
use only 3 join operations (see below) . 

Pr^: -^Wr I X I \Xia\dWs{outr) \ ~^Wouti 
Pt 2 - I ^X 
Pt^ - >^a\dWs{outs) I ~^Wout 2 

p'^^: -^Ws I ^ (oMtsJ I iX\dWt I ~^Wout 2 
p" : Split{out S 2 ) I ^aWr \ 

Identification of the common operations and sub-paths reduces the total num- 
ber of operations in the implementation. Consider two data stream processing 
paths Pa- aiWi | . . . | amWm and pp\ fdiVi | . . . | j 3 nVn- We say that a sequence 
of operations Tq, in pa matches a sequence of operations r/3 in p/j if a sequence 
of data containers used in Tq is the same as a sequence of data containers used 
in p/d- The operations Tq, and p/s can be merged if they match each other and if 
the paths they belong to have been already merged over the sequences and 

such that precedes Tq and follows Tq, or the opposite. For example, 
the paths PaF aiWi \ 02^2 and Pa2- OL2W2 \ aiW\ cannot be merged on both 
operations a\ and 02- If such a case happens we try to commute the sequence 
of operations in the paths whenever it is possible. 

Integration of the operations aw and (iw provides an operation a\( 3 w that 
recognizes the stream identifiers in the data elements and computes either aw or 
( 3 w. Of course the implementation of a\( 3 w shares the common code of the op- 
erations. A typical example would be the hash based implementation of X2,|nr(; 
where hashing of w over the attributes in x allows for the common implementa- 
tion of the operations and 0. 
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5 Synchronization of Data Flows 

Serial processing of the data stream elements is a simple however not very ef- 
fective way of performing the computations on data streams. High frequencies 
of the input streams and isolation of elementary operations in a data stream 
processing network suggest more simultaneous execution of the operations. An 
idea of parallel or concurrent computations immediately raises the questions 
about scheduling of the elementary operations and correctness of the schedul- 
ing. Assume that a data stream processing network implements a function / 
that processes a number of relational streams. Then, the network correctly im- 
plements a function / if for any sequence of input elements i5i,. . . ,(5„ the the 
network produces a sequence of values /(<5i, Si),. . . ,f{Sn, Sn) where Si,. . . ,Sn 
are the states of all data containers in the network after the arrivals of elements 
(5i,. . . ,Sn. In many cases such strict definition of correctness is not really needed. 
The high frequencies of the streams cause so frequent modifications of the out- 
puts that observation of all modification in the real time is simply impossible. 
Therefore we introduce a concept of partial correctness. We say that the com- 
putations performed by a data stream processing network are partially correct 
if for any sequence of input elements hi,. . . ,h„ some of the results are equal to 
/(hi, S'!),. . . ,/(h„, Sn) . Partial correctness requires that only from time to time 
the results produced by a network reflect the present state of input data streams. 

A sequence of computations along a path may be considered as a database 
transaction that operates on data containers. Then, at a logical level synchroniza- 
tion of the computations along paths is identical to synchronization of database 
transactions. To be correct, the executions along all paths must preserver the 
order serializability for all of its input elements It means that synchronization 
should preserve the orders of conflicting operations i.e read and write operations 
over the data containers. Consider a data stream processing network that has 
n input relational streams si,. . . ,s„ and such that it has no containers with in- 
termediate results of the computations. Then, in such a network the conflicts 
happens over the accesses to the windows on relational streams rcsi,. ■ ■ £^nd 
to the final results in Wout- The input elements are recorded in the windows 
on relational streams and in the same moment the other operations attempt to 
the contents of the windows. An operation that processes an element and 
reads a window Ws^ must read from Ws^ only the element that arrived before 
5 Si. It is easy to enforce this condition by examining the timestamp of <5^^ and 
the timestamps of data items read from Wgj and processing only the data items 
such that their timestamps are lower than the timestamp of The conflicts 
over access to the final result are sorted out with the timestamps as well. The 
results are recorded in an intermediate container and reordered accordingly to 
the values of timestamps in the final container. 

It is possible use timestamping to synchronize the computations in a class 
of stream processing networks that write and read from the intermediate data 
containers and such that the operations on data containers satisfy an invariant 
property. Let a data stream processing network computes a function /. We 
say that the network is invariant of for any permutation of a finite sequence of 
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elements 5i,. . . the final result produces by the network is always the same. 
A network processing the relational streams is an invariant network because 
processing of relational algebra operations does not depend on the order of 
elements, e.g. join of two relational table does not depend on the order of rows 
in the tables. To correctly synchronize the operations in an invariant data 
stream processing network we have to dynamically modify the timestamps of 
the elements flowing thorough the network. Each time an element is recorded 
in an intermediate data container it obtains a new current timestamp which is 
the largest timestamp in the network. It is possible to show that such method 
preserves the partial correctness of the computations performed by the network. 
A sketch of a proof is as follows. Consider an invariant network such that the 
operations — and aw belong to two separate path in the newtwork. A recorder 
operation -^w writes to an intermediate data container w and aw reads from the 
same container. Assume that elements 6^ and Sn are submitted for processing 
and recorded in the window in input streams such that timestamp of Sm is 
lower than timestamp of Sn- Let res{5m) denote the results of processing of 5m 
recorded in w and let res(5n) denote the results of processing of Sn that trigger 
execution of operation a. To correctly process the elements res(Sm) should be 
recorded in w before res{5n) triggers the execution of operation a. In such a 
case w is processed in a correct order because a will read all data items from w 
earlier inserted by the recorder. However, if recording in w of res(Sm) is late and 
operation a starts its execution too early then res{Sn) will never be processed by 
a against res(Sm)- It is because timestamp of res(Sm) is lower than timestamp 
of res(Sn)- A modification of timestamp associated with res(Sm) dynamically 
changes the order of elements Sm and i5„ processed by the remaining part of 
the network. A problem is that part of the network processes the elements in 
order Sm before S„ and other part of the network process the elements in the 
opposite order. If a network is invariant i.e. an order of input elements has no 
impact on the final result then when processing of both Sm and Sn is completed 
the results should be correct. This method supports only the partial correctness 
because the results after processing of Sn alone are definitely incorrect. The 
methods that achieve complete and partial correctness in non-invariant data 
stream processing networks are beyond the scope of this paper. 

6 Summary and Open Problems 

The modern applications of information technologies consider processing of fast 
evolving and theoretically unlimited sequences of data items commonly called 
as data streams. This work targets the grouped processing of relational alge- 
bra expressions over the relational streams. A relational stream is a sequence of 
tuples obtained from zipping of the raw data streams. We propose an extensi- 
ble system of elementary operations on relational streams and we describe the 
transformations of data stream applications into the data stream processing net- 
works. Next, we show how to merge the data stream processing paths in order 
to minimize the total number of elementary operations in the implementation. 



346 J.R. Getta and E. Vossough 



The work also dicusses the synchronization of parallel computations in the data 
stream processing networks. We introduces the notions of correctness and partial 
correctness of parallel computations and shows that timestamping is sufficient 
to preserve the correctness of computations in the networks which do not use 
intermediate data containers. A way how the logical level expressions on data 
streams are transformed into the data stream processing networks seems to be 
the most important contribution of this research. A concept of operational ele- 
ments in a data stream like insert and delete leads towards processing of the 
structured data streams. 

A number of problems remains to be solved. A central problem is the imple- 
mentation of a data stream processing system based on a set of elementary op- 
erations derivable from a common pattern defined in the paper and performance 
aspects of the system. The other group of problems includes synchronization of 
data stream processing in a general class of data stream processing networks and 
identification of the classes of data stream applications that can be translated 
into the data stream processing networks that do not need access to the windows 
on intermediate data streams. 

References 

1. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues 
in data stream systems. In Popa, L., ed.: Proceedings of the Twenty-first ACM 
SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, ACM 
Press (2002) 1-16 

2. Madden, S., Franklin, M.J.: Fjording the stream: An architecture for queries over 
streaming sensor data. In: 18th International Conference on Data Engineering 
February 26-March 1, 2002, San Jose, California, IEEE (2002) 

3. Arasu, A., Babcock, B., Babu, S., McAlister, J., Widom, J.: Characterizing mem- 
ory requirements for queries over continuous data streams. In Popa, L., ed.: 
Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on 
Principles of Database Systems, ACM Press (2002) 221-232 

4. Babu, S., Widom, J.: Continuous queries over data streams. SIGMOD Record 30 
(2001) 109-120 

5. Krishnamurthy, S., Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, 
M.J., Hellerstein, J.M., Hong, W., Madden, S.R., Reiss, F., Shah, M.A.: Tele- 
graphcq: An architectural status report. Bulletin of the Technical Committee on 
Data Engineering 26 (2003) 11-18 

6. Hellerstein, J.M., Franklin, M.J., Chandrasekaran, S., Deshpande, A., Hildrum, 
K., Madden, S., Raman, V., Shah, M.A.: Adaptive query processing: Technology 
in evolution. Bulletin of the Technical Committee on Data Engineering 23 (2000) 
7-18 

7. Cole, R.L.: A decision theoretic cost model for dynamic plans. Bulletin of the 
Technical Committee on Data Engineering 23 (2000) 34-41 

8. Bouganim, L., Fabret, F., Mohan, C.: A dynamic query processing architecture for 
data integration systems. Bulletin of the Technical Committee on Data Engineering 
23 (2000) 42-48 

9. Ives, Z.G., Levy, A.Y., Weld, D.S., Florescu, D., Friedman, M.: Adaptive query 
processing for internet applications. Bulletin of the Technical Committee on Data 
Engineering 23 (2000) 19-26 



Grouped Processing of Relational Algebra Expressions over Data Streams 347 



10. Urban, T., Franklin, M.J.: Xjoin: A reactively-scheduled pipelined join operator. 
In: IEEE Data Engineering Bulletin 23(2), IEEE (2000) 27-33 

11. Terry, D., Goldberg, D., Nichols, D., Oki, B.: Gontinuous queries over append-only 
databases. In: Proceedings of the 1992 AGM SIGMOD International Conference 
on Management of Data. (1992) 321-330 

12. Liu, L., Pu, C., Tang, W.: Continual queries for internet scale event-driven in- 
formation delivery. IEEE Transactions on Knowledge and Data Engineering 11 
(1999) 610-628 

13. Hellerstein, A.R.: Eddies: Continuously adaptive query processing. In: Proc. ACM- 
SIGMOD International Conference on Management of Data. (1998) 106-117 

14. Fiat, A., Woeginger, G.J.: On Line Algorithms, The State of the Art. Springer 
Verlag (1998) 

15. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proceedings 
of the 1997 ACM SIGMOD International Conference on Management of Data. 
SIGMOD Record (1997) 171-182 

16. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD 1997, 
Proceedings ACM SIGMOD International Conference on Management of Data, 
ACM Press (1997) 171-182 

17. Lee, E.A., Parks, T.M.: Dataflow process networks. Technical report. Department 
of Electrical Engineering and Computer Science, University of California (1995) 

18. Stephens, R.: A survey of stream processing. Technical Report CSRG95-05, De- 
partment of Electronic and Electrical Engineering, University of Surrey (1996) 

19. Madden, S., Shah, M., Hellerstein, J.M., Raman, V.: Continuously adaptive con- 
tinuous queries over streams. In: Proceedings of the 2002 ACM SIGMOD Interna- 
tional Gonference on Management of Data, Madison, Wisconsin, June 4-6, 2002, 
AGM Press (2002) 49-60 

20. Group, T.S.: Stream: The Stanford stream data manager. Bulletin of the Technical 
Committee on Data Engineering 26 (2003) 19-26 

21. Avnur, R., Hellerstein, J.M.: Eddies: Continuously adaptive query processing. In: 
Proceedings of the 2000 ACM SIGMOD International Conference on Management 
of Data, ACM (2000) 261-272 

22. Stonebraker, M., Cherniack, M., Cetintemel, U., Balazinska, M., Balakrishnan, H.: 
The aurora and medusa projects. Bulletin of the Technical Committee on Data 
Engineering 26 (2003) 3-10 

23. Viglas, S.D., Naughton, J.F.: Rate-based query optimization for streaming informa- 
tion sources. In: Proceedings of the 2002 ACM SIGMOD International Conference 
on Management of Data, ACM Press (2002) 37-48 

24. Das, A., Gehrke, J., Riedewald, M.: Approximate join processing over data streams. 
In: Proceedings of the 2003 ACM SIGMOD International Conference on Manage- 
ment of Data, San Diego, June9-12, 2003. (2003) 

25. Ganguly, S., Garofalakis, M., Rastogi, R.: Processing set expressions over contin- 
uous update streams. In: Proceedings of the 2003 ACM SIGMOD International 
Conference on Management of Data, San Diego, June9-12, 2003. (2003) 

26. Getta, J., Vossough, E.: Optimization of data stream processing. Submitted for 
publication in SIGMOD Record (2004) 



Processing Sliding Window Join Aggregate in Continuous 
Queries over Data Streams 



Weiping Wang', Jianzhong Li''^ DongDong Zhang', and Longjiang Guo''^ 



‘School of Computer Science and Technology, Harbin Institute of Technology, China 
^ School of Computer Science and Technology, Heilongjiang University, China 
{wpwang, li j zh, zddhit , guolongj iang} @hit . edu . cn 



Abstract. Processing continuous queries over unbounded streams require un- 
bounded memory. A common solution to this issue is to restrict the range of 
continuous queries into a sliding window that contains the most recent data of 
data streams. Sliding window join aggregates are often-used queries in data 
stream applications. The processing method to date is to construct steaming bi- 
nary operator tree and pipeline execute. This method consumes a great deal of 
memory in storing the sliding window join results, therefore it isn’t suitable for 
stream query processing. To handle this issue, we present a set of novel sliding 
window join aggregate operators and corresponding realized algorithms, which 
achieve memory-saving and efficient performance. Because the performances 
of proposed algorithms vary with the states of data streams, a scheduling strat- 
egy is also investigated to maximize the processing efficiency. The algorithms 
in this paper not only can process the complex sliding window join aggregate, 
but also can process the multi-way sliding window join aggregate. 



1 Introduction 

Recently a new class of data-intensive applications has become widely recognized: 
applications in which the data is modeled best not as persistent relations but rather as 
transient data streams. Examples of such applications include financial applications, 
network monitoring, security, telecommunications data management, web applica- 
tions, manufacturing, and sensor networks. The database research community has 
begun focusing its attention on query processing over unbounded continuous input 
stream. Due to their continuous and dynamic nature, querying data streams involves 
running a query continually over a period of time. 

Unbounded streams can’t be wholly stored in bounded memory. A common solu- 
tion to this issue is to restrict the range of continuous queries into a sliding window 
that contains the last K items or the items that have arrived in the last T time units. 
The former is called a count-based, or a sequence-based sliding window, while the 
latter is called a time based or a timestamp-based sliding window [12]. Constraining 
all queries by sliding window predicates allows continuous queries over unbounded 
streams to be executed in a finite memory and in an incremental manner by producing 
new results as new items arrive in. 

Evaluating sliding window join over streams is practical and useful in many appli- 
cations. Meanwhile, in many cases it is necessary to process aggregate over the results 
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of sliding window join. For example, in a building monitoring system, the sensors 
monitoring temperature generate stream A, which contains three attributes: Location, 
Time and Temperature. Meanwhile, the sensors monitoring smokes generate stream 
B, which also contains three attributes: Location, Time and Strength. In order to find 
out which room might be on fire, the manager of the building may put forward a 
query as the following: 

Ql: 

SELECT location, COUNT(*) FROM A[10 MINUTE], B[10 MINUTE] 
WHERE A.location = B. location 

AND A.temperature>=40 
AND B. strength >0.6 
GROUP BY location 
HAVING COUNT (*)>5; 

When Ql is processed in continuous queries, a new count value must be output as 
each new item of stream arrives in. A solution for processing sliding window join 
aggregate is illustrated by figure 1(1), upon each arrival of a new item from stream A, 
four tasks must be performed: 

1 . Insert the new item into the sliding window of stream A. 

2. Invalidate all the expired items in A’s sliding window and B’s sliding window. 

3. Process join of A’s sliding window and B’s sliding window, and insert the re- 
sults into a queue. 

4. Compute the aggregate function by scanning the queue one pass. 




( 1 ) ( 2 ) 

Fig. 1. (1) SW Join Aggregate Query Plan, (2) SW Join Aggregate Operator 



This method consumes a great deal of memory in storing the results of sliding win- 
dow join. Assume that there are 1000 items in sliding window A and sliding window 
B respectively, and the join selectivity is 0.1, then 100,000 join results need to be 
stored. Since memory is the most expensive resource in data stream query processing 
system, therefore it isn’t a good solution. In contrast, we propose a set of novel sliding 
window join aggregate operators, which compute the aggregate function while proc- 
essing the sliding window join. These operators needn’t store the join results, thereby 
they can save memory effectively. For each operator, we propose several realized 
algorithms, which can achieve memory-saving and efficient performance. Since the 
performance of algorithms vary with the states of data streams, a scheduling strategy 
is also presented, which dynamically assign the most effective algorithm for the op- 
erator according to the states of streams. The methods presented in this paper can 
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process not only the complex sliding window join aggregate, but also the multi-way 
sliding window join aggregate. 

The remainder of this paper is organized as following. In section 2, we review re- 
lated work. Section 3 describes the algorithms for the two-way sliding window join 
aggregate operator and gives the experimental study of proposed algorithms. Section 
4 presents techniques for maximizing sliding window join aggregate efficiency, while 
section 5 extends the proposed algorithms to process multi-way sliding window join 
aggregate. In section 6, we give our conclusions and identify future work. 



2 Related Work 

There has been a great deal of recent interest in developing novel data management 
techniques and adapting traditional database technology to the data stream model, 
such as. Cougar [2], Aurora [3], Telegraph [4,5] and STREAM [1]. The first two focus 
on processing sensor data. STREAM addresses all aspects of stream data manage- 
ment, including memory management, operator scheduling, and approximate query 
answering via summary information. A continuous query language (CQL) has also 
been proposed within the STREAM project [7]. 

Defining sliding windows is one solution proposed in the literature for bounding 
the memory requirements of continuous queries and unblocking streaming operators 
[12]. Since it is unpractical to processing the join over unbounded streams in bounded 
memory, a good deal of research has been conducted on processing sliding window 
join over data streams. The first non-blocking binary join algorithm, symmetric hash 
join was presented in [11], which was optimized for in memory performance, leading 
into thrashing on larger inputs. Sliding window joins over two streams were studied 
by Kang et al. [14], who introduced incrementally computable binary joins as well as 
a per-unit-time cost model. Lukasz et al have discussed sliding window multi-join 
processing over data streams, and also proposed some join ordering heuristic rules 
that provides a good join order without iterating over the entire search space [15]. 

Processing aggregate over data streams is another related research area. Manku et 
al presented algorithms for computing frequency counts exceeding a user-specified 
threshold over data streams [8]. Correlated aggregates were studied by J. Gehrke in 
[9]. A. Dobra et al calculated small “sketch” summaries of the streams, and used them 
to provide approximate answers to aggregate queries with provable guarantees on the 
approximation error. Datar et al proposed an algorithm maintaining samples and sim- 
ple statistics over sliding windows [12]. 

To the best of our knowledge, there have been to date no research works discussing 
the processing of sliding window join aggregate over data streams. 



3 Processing Two-Way Sliding Window Join Aggregate 

There are five two-way sliding window join aggregate operators according to the 
aggregate functions, namely SWJ-COUNT, SWJ-SUM, SWJ-AVG, SWJ-MAX and 
SWJ-MIN. Among them, SWJ-COUNT, SWJ-SUM and SWJ-AVG work in same 
way; meanwhile, SWJ-MAX and SWJ-MIN work in same way. In this paper, we only 
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propose the realized algorithms for the operator SWJ-COUNT, which are also suit- 
able for SWJ-SUM and SWJ-AVG. For the limitation of the space, the realized algo- 
rithms for operator SWJ-MAX and SWJ-MIN are not presented in this paper. Table 1 
lists the meaning of symbols used in this paper. 



Table 1. Symbols 



Symbol 


Meaning 


SWA, SWB 


Sliding window corresponding to stream A, B 


a, B 


Number of items in SWA, SWB respectively 


T„ 


Time size of time-based window SWA 




Arrival rate of stream A 


M 


Number of hash buckets in hash table of sliding window 


C 


Cost of accessing one item in sliding window 


G 


Selectivity of join operator 


sizeA,sizeB 


Item size of stream A, B respectively 


aob 


Concatenation of item a and item b 


X 


Join 



Before illustrating algorithms in detail, we introduce several definitions. 

Definition 1. A data stream is an infinite time sequence with the incremental order, 
5={i^, ...., J,, .... } , i,is an item appearing at time t. 

Definition 2. Let The a time interval size, and t>T is the variable time, SW^t-T : f] is 
defined as a time-based sliding window of data stream i, whose time size is T. 

Definition 3. SW„ is a snapshot of the sliding window at the time T^, SW.^ = 
SW[T,-T : T,]. 



3.1 SWJ-COUNT Evalnating Algorithms 

The function of SWJ-COUNT operator is computing the number of sliding window 
join results. Three algorithms will be introduced: SC, IC and TC. 

3.1.1 SC Algorithm 

SC is a simple algorithm to process SWJ-COUNT. When a new item of stream A (or 
B) arrives in, firstly it is inserted into sliding window SWA (or SWB), and then all 
expired items in sliding window SWA and SWB are removed (whose timestamps are 
now outside the current time window). Finally, process the join of SWA and SWB, 
meanwhile, the aggregate value is computed along with processing join. 
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The most time cost of SC is processing join. Since hash join algorithm always has 
optimal efficiency in several join algorithms, it is chosen to process join in SC and 
sliding windows are constructed in hash table. 

Before analyzing the cost of SC algorithm, we briefly introduce our cost model. 
Each arrival of an item from streams triggers the sliding window join aggregate algo- 
rithm to execute once. In this paper, the time cost we consider is the time that the 
algorithm executes once. 

In the first step, SC algorithm inserts a new item into sliding window, and the cost 
of this step is C. The second step of SC scans both SWA and SWB in one pass, with a 
cost of {a+P)C. Assuming that SWA and SWB have M buckets respectively, and each 
bucket of SWA ,SWB has a/M and piM items, respectively on average. Then the cost 

(X B oc^ B 

for processing SWA x SWB is M * ( — * — ) * C = * C . Here we ignore the cost 

MM M 

of computing aggregate function, because it is done along with the processing join. 
The total time cost for SC algorithm is 

oc^ B cc^ B 

C + {a+P)*C + ^*C-^{a+P + ^)*C 

M M 

It is easily to know that the space cost of SC is O {a*sizeA+P*sizeB). 

3.1.2 IC Algorithm 

Since the whole items in SWA must be joined with whole items in SWB at each exe- 
cution of SC algorithm, the time cost of SC is obviously very high. Because only a 
few items would be expired at each time sliding windows invalidation, there are much 
more duplicated computations between the two executions of SC algorithm. In this 
section, we will present a new algorithm IC, which performs more efficient than SC 
algorithm. 

The main idea of IC algorithms is described in figure 2, IC calculates the SWJ- 
COUNT in an incremental manner, that is, the (k-i-1)^ result of SWJ-COUNT is com- 
puted based on the k,^ aggregate value. 




Fig. 2. IC Algorithm 
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Before discuss the detail of the algorithm, we introduce a theorem. 

Theorem 1. Give four relations A, B, C and D, the following equation is true: 

|(CxD)| = |(AxB)| + |(C\A) xD| + |(D\B) xC| - |(C\A) x(D\B)| - |(A\C) x(BnD)| - 
|(B\D) x(AnC) I - |(A\C) x(B\D)|. 

The proof for Theorem 1 can be found in Appendix A. Assuming that the current 
snapshot of SWA, SWB is 51TA^g, SWB^^, and now the I^VTA^^x^wb^kI is known, when 
a new item / of stream A arrives in at time Tj, the following is the description of IC 
algorithm. 

IC Algorithm 

Input: SWA, SWB, Tj,f, CountValue, T 

Output: CountValue 

1. Insert / into 51TA 

2. Invalidate every expired tuple g in SWA and SWB which satisfy Tj - 
g.timestamp>T.Pnsh the expired tuples of SWA and SWB into collection SetA 
and SetB respectively; 

3. vail = COUNT([f}xSWB); 

4. val2 = COUNTiSetAxSWB)-, 

5. val3 = COUm{SetBx{SWA\{f})); 

6. val4 = COUNT(5etA xSetB)-, 

7. CountValue = CountValue + vail - val2 - val3 - val4; 

8. clear collection SetA, SetB', 

9. Return CountValue; 

Here COUNT(A) means a function that computes the number of items in set A. 
After the insertion of new item / and the invalidation of SWA and SWB, the current 
sliding window snapshot of SWA, SWB is SWA.^^, SWB^^ respectively We want to 
compute the |5WA.j.jX5'VTB.jj|. According to theorem 1, we can get the following equa- 
tion: 

|(SWA„xSWBJ| = KSWA^^xSWB^Jl H- |(SWA„\SWA^JxSWB„| H- 
|(SWB„\SWB^JxSWA„| - |(SWA„\SWA^Jx(SWB„\SWB^J| - |(SWA^^\SWA„) x 
(SWB^^nSWBJl - |(SWB^^\SWB„)x(SWA^^nSWA„) I - |(SWA^^\SWA„) 

x(SWB^^\SWBJ| 

According to the description of IC algorithm, we know that SWA^jSSWA^^ = {/}. 
Since there is no item of data stream B coming in, it is true that SWB^j ^SWB^^. 
namely SWB,j.j\SWB^^=(j) . It is easy to know that SWA^^SWA^j = SetA, SWBj.^\SWBj.j = 
SetB, SWA„nSWA^,= SWAJffj and SWB„HSWB„= 5WB„ Then: 

|(SWA„xSWBJ| = ISWA^^xSWB^J H- |[/]xSWB^J - |SetAxSWB„| - 
|SetBx(SWA„\{/])| - |SetAxSetB|. 

Several join operations need to be processed in IC algorithm, here we also choose 
hash join algorithm to process them (excluding the join of SetA and SetB). In IC, the 
cost of the first step is C and the second step cost is {a+/l)C. SWB consists of M buck- 
ets, and each bucket has fUM items, consequently, the cost of the third step is (filM)* 
C. The time cost of the fourth and the fifth step are \SetA\*lllM* C and \SetB\*a!M* C 
respectively. We choose the nested loop join algorithm for processing the join of SetA 
and SetB, so the sixth step cost is \SetA\*\SetB\*C. The total time cost of IC is: 
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C+{a+P)C+(fi!M)C+ \SetA\*piM*C+\SetB\*alM*C+\SetA\*\SetB\*C 
~ {a+P)C+{filM)* C+ \SetA\*piM*C+ \SetB\*a!M* C +\SetA\*\SetB\*C 

Although IC algorithm needs process several join operations, it has lower execu- 
tion time cost since only a few items participate each join operation. The space cost of 
IC algorithm is 0(a*sizeA + p*sizeB). 

3.1.3 TC Algorithm 

IC algorithm performs effectively, however, there are still several join operations 
need to be processed, which consume much more time. In contrast, only one join 
should be processed in TC algorithm to calculate the aggregate by means of storing 
some summaries for each item in the sliding window. 



t 




Fig. 3. TC algorithm 

The concept of TC algorithm comes from the observation that, given two relations 
AandB, |AxB| = ^| {.g} X Z? I . As is shown in figure 3, each item in SWA and SWB 

V«€A 

is assigned an attribute CV. Assuming that SWA joins SWB based on attribute a, for 
each item/in the SWA,f.CV = |{ g | geSWB, g.timestamp >f.timestamp , g.a=f.a}\. It 
means that the value of f. CV is the number of items in SWB which arrive later than / 
and match with item/. 

It should be emphasized that the value off.CV is not the total number of items in 
SWB that matches with/. If the value of attribute CV is defined as that, when an item 
g matching with / in SWB is expired, the value of /.CV must be decreased by one. 
That is, we have to process to find out all the items in SWA that match with 

item g and decrease their CV attribute value by one. The join operation can’t be 
avoided during the invalidation of sliding window if attribute CV is defined like that. 

Adopting our definition for the attribute CV, the invalidation of sliding windows 
wouldn’t modify the attribute CVs value. Let’s prove it. 

Theorem 2. Given two sliding window SWA, SWB with T time size, V SWA,k has 
an attribute CV, and k.CV = |{ g | geSWB, g.timestamp >k.timestamp , g.a=k.a}\. At 
the time T^, an item g in SWB is expired, then V ke SWA^.^, the invalidation of g 
doesn’t modify the value of k.CV. 
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Proof. Suppose that the invalidation of g modifies k’s CV value, it means that equa- 
tion g.a=k.a and g.timestamp >k.timestamp exists. Since the item g is expired, - 
g.timestamp>T is ture, then T^ - k.timestamp>T, that is k^ SWA^.^ which conflicts with 
ke SWAj.j^. So the supposition is wrong, and the theorem is proved. □ 

According to theorem 2, it needn’t process the join when invalidate sliding win- 
dow. 

Theorem 3. Given two sliding window SWA and SWB, V/ eSWA,f.CV = |{ g | g 
eSWB, g.timestamp >f. timestamp, g.a=f.a}[, ge SWB, g.CV = \{ f \ f eSWA, 
f.timestamp >g. timestamp ,f.a=g.a}\, the following equation exists: 

\SWAxSWB\ = '^k.CV 

'^k-eSWA.SWB 

Proof. V reSWA xSWB, then r =fo g, where /eS'VPA, geSWB. According to the 
definition of attribute CV, if /.times tamp> g.timestamp, then g.CV add l,else/Cy add 

1 . So V rG SWA xSWB , its contribution to ^k.CVis one . That is: 

Vk^ESWASWB 

\SWAxSWB\ = ^k-CV □ 

Vk.«SWA,SWB 

TC algorithm is designed according to theorem 3. Assuming that a new item /of 
stream A arrives in at time 7j, the TC algorithm is shown as following. 

TC Algorithm 

Input: SWA, SWB, T,f, T 

Output: CountValue 

1. f .CV = 0; CountValue = 0; 

2. Insert/into SWA', 

1. Invalidate every expired tuple g in SWA and SWB which satisfy - 
g. timestamp > T. 

2. Find bucket B in SWB that match with/’s hash value 

3. FOR V tuple gG/J DO 

4. IF g.a=f.a 

5. g.CV++', 

6. FOR V ke SWA DO 

7. CountValue = CountValue + k.CV', 

8. FOR VgG5W5DO 

9. CountValue = CountValue + g.CV', 

10. Return CountValue', 

Let’s analyze the time cost of TC algorithm. We ignore the first step cost. The cost 
of the second step is C, and the cost of the third step is (a+/)C. In TC algorithm, the 
structure of sliding window is also a hash table, and the time cost of steps 5~7 is 
(/IM)* C. Steps 8~12 scan both sliding windows, with a cost of (a+/)C So the whole 
cost of TC is: 

C+2{a+P)C + piM* C ~ 2{a+P)C+piM* C 
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In TC algorithm, each item in sliding windows has an attribute CV, whose type is 
integer. Then the space cost of TC algorithm is: 

O {a*sizeA+P*sizeB+(a+P)*SizeOf(Int}). 



3.2 Performance Study 

3.2.1 Experimental Setup 

To examine the performance of the proposed algorithms, we implement them in C. 
The experiments are performed in Intel PIV 2.4Ghz with 256MB memory, running 
windows 2000. We generate two continuous data streams A and B randomly, stream 
A consisting of four attributes: Location, Time, Temperature and Timestamp, and 
B also with four attributes: Location, Time, Strength and Timestamp. The size of 
A’s item and B’s item are both 18 Bytes. The following query is tested in our experi- 
ments: 

SELECT location, COUNT(*) 

FROM A[T SECONDS], B[T SECONDS] 

WHERE A. location = B. location; 

To compare the performance of the algorithms, we also implement PC algorithm, 
which computes the sliding window join aggregate with the operator tree. PC algo- 
rithm has been induced in section 1 . 

We test the performance of proposed algorithms in two scenarios: 

(i) Variable time window and Constant arrival rate (VTCA) 

(ii) Constant time window and Variable arrival rate (CTVA). 

We consider CTCA in Section 3. 2. 2, and consider CTVA in Section 3.2.3 

3.2.2 Variable Time Window and Constant Arrival Rate 

In this experiment, the parameters for SWA and SWB are the same, the arrival rate of 
the stream is 100 items/sec, the selectivity of the join operator is 0.01, and the number 
of buckets in the hash table is 100. Figure 4 shows the performance of the four algo- 
rithms in the case that the window size is varying, and Figure 5 shows the amount of 
memory consumed by each algorithm. 

As is shown in figure 4, when the stream rate is constant, IC algorithm outperforms 
the other algorithms, and PC algorithm has the worst performance. Let’s analyze the 
cost of PC algorithm, suppose that a new item / of stream A come in, the cost of in- 
sertion and invalidation of PC algorithm is (a+P)*C. The cost of evaluating {f\xSWB 
is (p/M)*C, and join results would be produced. Insert the join results into the 
queue, the cost is p*a*C^, here parameter C,^is much larger than C'. The final step of 
PC algorithm scans the queue to calculate the count, whose cost is (a*P*a)*C. The 
total cost of PC algorithm is (a+P)*C +(P/M)*C +p*a*C^+(a*P*a)*C. The join se- 
lectivity a has great impact on the performance of PC algorithm according to the cost 
formula. In our experiment, a = 1/M, then the cost of PC is {a+P)*C +(p/M)*C 
+p/M*C^+{a*p/M)*C, which is much higher that the cost of the other algorithms. 



* The cost of inserting the new item into sliding window also should be C„. Since only one 
item needs to be inserted into sliding window, we approximately consider its cost as C. 
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Interestingly, IC executes faster than TC. The reason is that, in the case of constant 
stream arrival rate, only one item in sliding window would be expired when a new 
item arrives in, that is, \SetA\ =1 and \SetB\=\. The time cost of IC is (a+/3)C + 
{alM+piM)* C + (PIM)* C +C, which is less than the cost of TC, 2*(a+P)C+p/M* C. 
As figure 4 illustrates, the curves for PC and SC increase sharply with the growth of 
sliding window size, while the curve for TC goes up slowly. The sliding window size 
has little impact on the performance of IC algorithm. 

Figure 5 shows the curves of memory used by the four algorithms. As expected, 
the amount of memory used by PC is greatly larger than by the other algorithms, since 
it stores the join results in memory. The amount of memory used by TC is a little 
larger than IC and SC because it requires every item in sliding window has an extra 
attribute CV. SC and IC have the lowest space cost. 

70 

.-.60 
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2^50 
1 40 
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|20 
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Fig. 5. Varying sliding window size 
100, M= 100, (T = 0.01) 





Fig. 4. Varying sliding window size 
(X^ = X^= 100, M= 100,(7 = 0.01) 



3.2.3 Constant Time Window and Variable Arrival Rate 

In this scenario the data stream arrival rate is varying and the sliding window time 
size is constant. The parameters for SWA and SWB are also the same. We randomly 
generate two data streams whose arrival rates obey the Zipf distribution. At the i,^ 

second the arrival rates of data streams are A; =^x5000 items/sec, where 1 i 10. 

i 

The sliding window time size is equal to 5 seconds, the selectivity of join operator is 
0.05, and the number of buckets in hash table is 20. 

Figure 6 illustrates the performance of four algorithms. We begin the experiment 
after the sliding window is full, that is, begin at the 6,^ second. IC and TC still outper- 
form the other two algorithms. In the 6,^ second, TC executes faster than IC, and in 
rest time, IC performs optimal. We compare the cost of IC with the cost of TC: 

IC : (a+P)C+(p/M)* C+\SetA\*p/M* C +\SetB\*a/M* C +\SetA\*\SetB\*C 
TC: 2*(a+P)C+p/M* C 
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Time elapsed (seconds) 

Fig. 6. Varying stream rate (T^=T=5s , M =20, (t=0.05) 

It is easy to conclude that, if \SetA\*fi/M +\SetB\*a/M+\SetA\*\SetB\ >a+P, then TC 
outperforms IC. In this experiment, at the 6th second, a=yS=^ ^ =7318. Suppose 

i=i 

that an item arrives in at this time, then |SetA| = |SetB| = 'kj{2'k^=10, so \SetA\*piM + 
\SetB\*a!M + \SetA\*\SetB\ = 2*7318*70/20+70*70 = 56126,while a+ p = 14636. It 
means that TC execute more efficient than IC. We compute the costs of IC and TC at 
each second, and we find that the results exactly correspond with the curves in fig- 
ure 6. 

The above two experimental results show that when the data stream arrival rate is 
constant, IC has optimal performance; and when the data stream arrival rate varies 
with time rapidly, TC is the best choice to process the join aggregate. In the following 
section we will present a scheduling strategy that can assign the most optimal algo- 
rithm for the operator according to the states of data streams. 



4 Maximize the Processing Efficiency 

4.1 Scheduling Strategy 

Continuous queries are the long running queries, and some parameters, such as join 
selectivity and stream arrival rates, will change in the running of continuous queries. 
It is crucial for the query optimizer to choose the most effective algorithm according 
to the states of steams. 

In section 3, we have introduced three time-cost formulas for each proposed algo- 
rithm. There are three factors that affect the cost of SC,IC and TC: the number of 
items in sliding window a and p, the number of buckets in hash table M , \SetA\ and 
l^etBl. We can get these parameters before executing the algorithms. When an item 
arrives in from streams, the query optimizer computes the time cost for each algo- 
rithm and chooses the algorithm that has the lowest time cost to execute. 

IC algorithm processes sliding window join aggregate based on the last aggregate 
value, while TC algorithm requires every item in sliding window has attribute CV and 
corresponding value, therefore IC and TC can’t switch to each other directly. We 
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make some modifications for the two algorithms when involving the scheduling strat- 
egy. In IC algorithm, each item also has the CV attribute, and its value is computed 
during processing [f] ^SWB. In TC algorithm, the last time aggregate value is stored. 



4.2 Experimental Results 

We develop the algorithm HC that involving the scheduling strategy and compare the 
performance of HC with IC and TC. In this experiment the parameters for SWA and 
SWB are also the same. Two data streams are generated randomly, with the arrival 
rate at the i,^ second being: 

, f i X 500 i mod 2 = 1 
A; = 1 , 1 i 10 

[i ' x500 ;mod2 = 0 

In this experiment the time size of the sliding windows are equal to 5 seconds, the 
selectivity of the join operator is 0.05, and the number of buckets in hash table was 
20. As is shown in figure 7, the HC algorithm achieves optimal performance at any 
time as expected. 




Fig. 7. Performance of scheduling strategy (T=T=5& , M =20, (7=0.05) 



5 Processing Complex Sliding Window Join Aggregate 

In this section, we will discuss how to extend our proposed algorithms to process 
complex sliding window join aggregate, such as query that contains Group By clause 
or multi-way sliding window join aggregate. 
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5.1 Processing Group By Clause 

It is easy to extend the proposed algorithms to process the query that contains Group 
By clause. We illustrate it with an example of algorithm IC. The improved IC algo- 
rithm is termed as GIC algorithm. 

GIC uses a list structure, Glist to store the last aggregate values for each group 
value. When an item from steam comes in, GIC updates the sliding window and gets 
the expired items set SetA and SetB, and then processes joins: {f]xSWB, SetAxSWB, 
SetBx{SWA\{f\) and SetAxSetB. GIC updates each group aggregate value stored in 
Glist according to the join results. Finally, the Glist is output. If the query also has the 
Having clause, a predicate is used to filter the Glist value. 



5.2 Processing Multi-way Sliding Window Join Aggregate 

Multi-way sliding window join processing issue has been discussed by Lukasz in 
[15]. They combined some binary join operators and reorganized the query plan dy- 
namically. Viglas et al proposed the alternative of extending existed symmetric binary 
operators to handle more than two inputs, and completed an implementation of multi- 
way join operator, Mjoin [16]. It is straight to apply Mjoin operator to handle multi- 
way sliding window join. As is shown in figure 8, when a new item arrives in, it is 
inserted into the corresponding sliding window, and then invalidates all the sliding 
window’s expired items. The new item probes the others sliding windows and outputs 
the join results. 

S1MS2 M S3 XI ... Xs„.,Xs„ 




Fig. 8. Mjoin operator 

We process the multi-way sliding window join aggregate based on Mjoin operator. 
As an example, TC algorithm is extended to process the multi-way sliding window 
join aggregate. The new algorithm is named as MTC algorithm. When a new item/ 
arrives in from stream Sj, a set of join results are produced after the execution of 
Mjoin. Each join result r can be represented asfogfi.-.og^, where g.eSW.. Because the 
invalidation of any g. can cause join result r expire, the time to live for r is decide by 
the item that arrive in system the earliest among the items g^, g,, ... , g^. Since the 
contribution of r to the COUNT value is one, let g^.Cy add one. According to theo- 
rem 3: 

n 

\SW^xSW2X---xSW„\ g.cv 

geSi 

MTC algorithm calculates the multi-way sliding window join aggregate according 
to this formula. 
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5.3 Experiment Results 

We implement four algorithms GSC, GPC, GIG and GTC to handle the sliding win- 
dow join aggregate query that contains Group By clause, which extends from SC, PC, 
IC and TC algorithm respectively. The performances of the four algorithms are tested 
in this experiment. As is shown in figure 9, in constant streaming rate, GIG has the 
optimal performance. 

The performance of processing multi-way sliding window join aggregate is also 
tested. We implement three algorithms MSG, MPC and MTC, which base on SC, PC 
and TC correspondingly. Among the three algorithms, MTC executes the most effi- 
cient. As figure 10 illustrates, with the sliding window number grows up, the per- 
formance of MSC and MPC decrease sharply, while MTC algorithm is still efficient. 




Fig. 9. Processing Group By clause (1 = /I = 
100, M= 100, (T = 0.01) 




Sliding window number 



Fig. 10. Processing Multi-SW join aggregate 
100, r = r,= lOs, M = 100, a= 0.01) 



6 Conclusions and Future Work 

Sliding window is an often-used method to restrict the range of continuous queries 
over unbounded data streams. Many effective methods for processing sliding win- 
dows queries have been investigated, however, to our knowledge no algorithm for 
processing sliding windows join aggregate has been proposed to date in the literature. 
We present three novel sliding window join aggregate algorithms: SC, IC and TC. 
Theoretical analysis and experiment results show the proposed algorithms are effi- 
cient. Since the algorithms performance vary with stream arrival rate, a scheduling 
algorithm is also presented to maximize the processing efficiency. 

In this paper we only consider aggregate function COUNT. It is easy to extend the 
proposed algorithms to handle aggregate function SUM and AVG. But these methods 
are not suitable for processing aggregate function MAX and MIN. In future work, we 
plan to investigate the algorithms for processing MAX and MIN over the sliding win- 
dow join results. 
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Appendix A 



Theorem 1. Given four relations: A, B, C and D, the following equation is true. 

|(CxD)| = |(AxB)| + |(C\A)xD| + |(D\B)xC| - |(C\A)x(D\B)| - |(A\C)x(BnD)| - 
|(B\D)x(AnC) I - |(A\C) x(B\D)| 

Proof. According to the rule (A\B)xC = (AxC) \ (BxC), then 

|(C\A)x(D\B)| = |(C\A)xD \ (C\A)xB| = |(C\A)xD| - |(C\A)xD 0 (C\A) xB | = |(C\A) 
xD| - |(C\A) x(DnB)|, similarly, |(A\C)x(B\D)| = |(A\C) xB| - |(A\C) x(BnD)| 

So the right side of the equation in theorem 1, |(AxB)| + |(C\A)xD| + |(D\B)xC| - 
|(C\A)x(D\B)| - |(A\C) x(BnD)| - | (B\D)x(AnC) | - |(A\C) x(B\D)|= |(AxB)| + 
|(C\A)xD| + |(D\B) xC| - |(C\A)xD| + |(C\A)x(DnB)| - i(A\C)x(BnD)| - | (B\D) 
x(AnC) I - |(A\C)xB| +|(A\C)x(BnD)|= |(AxB)| + |(D\B) xC| + |(C\A)x(DnB)| - 
|(B\D)x(AnC)| - i(A\C) xB|. 

According to the follow equation: 

|(C\A)x(DnB)| = |Cx(DnB) \ Ax(DnB)| = |Cx(DnB)| - |Cx(DnB) n Ax(DnB)| = 
|Cx(DnB)| - l(cnA) x(DnB)|. 

Similarly |(B\D) x(AnC) | = |Bx(AnC)| - |(BnD)x(AnC)|. Then |(AxB)| + |(D\B) 
xC| + |(C\A)x(DnB)| - i(B\D)x(AnC)| - |(A\C) xB|= |(AxB)| + |(D\B)xC| + 
|Cx(DnB)| - |(CnA)x(DnB)| - |Bx(AnC)| + |(BnD)x(AnC)| - |(A\C)xB| = |(AxB)| + 
i(D\B)xC| + |Cx(DnB)| - |Bx(AnC)| - |(A\C) xB| = |(AxB)| + |DxC \ BxC| + 
|Cx(DnB)| - |Bx(AnC)| - |AxB \ CxB| = |(AxB)| + |DxC| - |(DxC)n(BxC)| + 
|Cx(DnB)i - |Bx(AnC)| - |AxB| + |(AxB)n(CxB)| = |(AxB)| + |DxC| - |(DnB)xC| + 
|Cx(DnB)| - |Bx(AnC)| - |AxB| + |(AnC)xB|= |DxC| = |CxD|. □ 
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Abstract. There currently exist many geographical databases that rep- 
resent a same part of the -world, each with its own levels of detail and 
points of view. The use and management of these databases sometimes 
requires their integration into a single database. One important issue 
in this integration process is the ability to analyse and understand the 
differences among the multiple representations. These differences can of 
course be explained by the various specifications but can also be due to 
updates or errors during data capture. In this paper, after describing the 
overall process of integrating spatial databases, we propose a process to 
interpret the differences between two representations of the same geo- 
graphic phenomenon. Each step of the process is based on the use of an 
expert system. Rules guiding the process are either introduced by hand 
from the analysis of specifications, or automatically learnt from exam- 
ples. The process is illustrated through the analysis of the representations 
of traffic circles in two actual databases. 



1 Introduction 

In recent years, a new challenge has emerged from the growing availability of het- 
erogeneous databases: their interoperability. The combination of multiple sources 
can be lead to several solutions: “multidatabase” systems [1], “federated” sys- 
tems [2] or “distributed database” [3]. 

Database integration has already received much attention in the literature. 
Some contribution concerns the problem of schema integration, for which a sur- 
vey of the different approaches can be found in [4,5]. The schema matching 
reveals various differences and conflicts between elements (semantic conflicts, 
structural conflicts,...) and several propositions exist to solve them [6,7]. Other 
works have focused on the problem of data integration. Procedural or declarative 
approaches can be adopted to achieve it [4,8]. Other contributions propose to 
help the integration with artificial intelligence techniques [9] , in particular, with 
knowledge representation and reasoning techniques [10]. 
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In the field of geographical databases (GDB), the integration becomes an 
issue of growing interest. The combination of multiple sources aims to increase 
the potentiality of applications development. It can also help the producers to 
maintain their databases in a more consistent way: updates can be propagated 
automatically and quality controls can be facilitated [11]. According to the strat- 
egy, the integration can lead to the creation of a composite product or a multi- 
representation system. Specific tools for merging geometrical data have been 
proposed [12] as data models and structures to support multiple representations 
[13,14,15]. 

Today, an important issue has still to be solved: the detection and manag- 
ment of inconsistencies between the geographical representations. Objects of the 
databases may be collected according to different specifications and can thus 
present various modelizations. These differences are generally normal and reflect 
the multiple levels of details of the databases and their application domains. 
However, incompatibilities can exist between representations deriving from up- 
dates or errors during the data collect. These inconsistencies are more prob- 
lematic because the user of the system can be confronted with contradictory 
results when using one representation or an other. Incompatibilities can relate 
to the shape and position of the objects as well as attributes and spatial rela- 
tions. Few works tackle this problem and in general, the propositions deal with 
the consistency of topological relations and they presuppose an order between 
representations [16,17]. 

In this paper, a new approach is proposed to assess the consistency between 
the geometrical representations of geographical data. The approach is based on 
the following points: 

~ The use of background knowledge and in particular, the specifications of 
each database, to justify the difference between multiple representations. 

— The use of a rule-based system to manipulate this knowledge and check the 
consistency in an automatic way. 

— The use of machine learning techniques to acquire the knowledge when the 
specifications are not available, ambiguous or incomplete. 

The paper is structured as follows. First, we present the integration process 
for spatial databases and discuss about its particularities (section 2). Then, we 
detail the elements of our approach and present the architecture of the system 
implemented (section 3) . We illustrate the feasibility of the process with a partic- 
ular application (section 4). Then, we conclude the paper and give some research 
perspectives (section 5). 

2 Spatial Database Integration Process 

2.1 Specificity’s 

The unification of vector spatial databases requires adaptations of classical in- 
tegration methodologies [18,19]. These adaptations mainly result from the exis- 
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tence of a geometry associated to the data. The geometry makes the correspon- 
dence identification more complex and introduce specific conflicts between the 
data. 

For instance, particular matching tools are necessary to define the corre- 
spondences between the data of the different sources because in general, it is 
not possible to declare a common identifier as for the classical databases. Only 
the position of the objects can be exploited. Generally, the matching algorithms 
are thus based on the computation of distances between geometric locations. 
Some of them also used additional criteria as the shape of the objects and the 
topological relations they have [20,21,22]. 

Specifics conflicts also appear between the data [23] . It concerns both spatial 
representations and attributes. The conditions of representation of the objects 
can be different from one database to another and it can lead to specifications 
conflicts (figure 1). For instance, an object can be represented in one database 
but be absent in the other (selection conflict) or this object can correspond to 
a group of primitives in the second database (decomposition conflict leading to 
a fragmentation conflict at the data level). These conflicts must be identified 
during the declaration of correspondences, and integration techniques must be 
extended to solve them [18]. 




Fig. 1. Different representations of a same geographical phenomenon leading to several 
conflicts between data. 



Another important particularity of the spatial databases, which makes the 
integration process more complex, is the existence of lots of implicit information. 
There is a gap between what we perceive when we visualise the geometrical 
instances and what is actually stored in the databases. For example, it is easy 
to identify at the screen the main road that leads to a particular town, or the 
most sinuous river among a set of objects of this category, but this information 
(“main”, “leads”, “sinuous”) are not explicitly stored in the databases. However, 
the integration frequently requires the extraction of this kind of information in 
order to homogenise the different sources and to check the consistency of the 
representations (see section 4). Specifics tools of spatial analysis are generally 
required to achieve this extraction. 
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Fig. 2. Main steps of the spatial databases integration process 



2.2 Main Steps of the Integration Process 



The main steps of the spatial integration process are illustrated in figure 2. They 
can be connected with the three phases defined for classical databases: the pre- 
integration, the correspondences investigation and the integration (see [24] for 
a detailed description). The first task consists in the study of each database, 
to have a good understanding of the content, and to prepare the integration. A 
detailed analysis of the specifications is undertaken and several modifications at 
the schema and data levels are realized. The schemas are rearranged in differ- 
ent ways to make them more homogeneous semantically and syntactically. The 
spatial data are enriched by extracting the implicit information. 

For this first step, an important issue concerns the specifications analysis 
phase. These documents are usually expressed in natural language and thus in 
a poorly formalised language. For geographical databases, they are particularly 
huge and detailed. Their manipulation is a hard task, and their automatic ma- 
nipulation is almost impossible. We have recently proposed a model to better 
formalise these specifications and used them in a more pratical way [25] , but this 
research is still in progress [26] . 

The second task aims at identifying and declaring correspondences between 
the elements of the schemas and the geometrical instances of the databases. 
Similar elements and conflicts at the schema level can be defined with a particular 
language. For instance, [8] proposes some clauses to specify respectively what 
are the related schema elements (the Interdatahase Correspondence Assertions - 
ICA), how the schemas elements are related using the relationships holds, how 
corresponding instances are identified (With Corresponding Identifiers - WCI), 
and how representations are related ( With Corresponding Attributes - WCA). At 
the data level, the elements are put in correspondence with matching algorithms. 
A particular clause refers to it: the Spatial Data Matching (SDM) [18]. 
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For this step, a current issue concerns the assessment of consistency between 
the spatial representations. It is necessary to detect errors before to integrate 
the data. This is our research problem and we present our approach in the next 
section. 

The last tasks concern the actual integration. According to the specifications 
of the unified database, the new schema is defined. It supposes to have rules to 
translate and restructure the initial schemas. Data instances are also integrated. 
As the case may be, representations are merged or kept and transferred in the 
new system. 



3 Consistency Assessment Between Spatial Data 

3.1 Our Overall Approach 

Our approach to assess the consistency between representations is illustrated 
in figure 3. The process is decomposed in several steps. We briefly describe it 
below but each phase will be detailed throughout the application developed in 
section 4. 




Fig. 3. The reasoning path to assess the consistency between multiple representations 



The process starts with one correspondence between classes of the schemas 
of the two databases. We presume that matching at the schema level has al- 
ready been carried out. It remains to compute and study the correspondences 
between instances at the data level. One correspondence can be expressed in 
terms of ICA [8]. For example, we know that the building class in DBi tallies 
with the residential building and commercial building classes in DB 2 , which can 
be expressed in the form: 

031. Building = 032.{ResidentialBuilding,CommercialBuilding). 

The task of specifications analysis is the next step. The specifications are 
analysed in order to determine several rule bases that will be used to guide each 
of the ensuing steps. These rules primarily describe what exactly the databases 
contain, what differences are likely to appear, and in which conditions. In that 
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sense, these rules constitute the whole necessary knowledge for the assessment 
process. Basically, this step is performed through the analysis of documents. 
However, these documents are sometimes not available or their description are 
incomplete and ambiguous. In these cases, knowledge elicitation techniques can 
be used in order to learn these specifications or to complete them [27] . 

The next step concerns the enrichment of each dataset. As we have already 
mentioned, the aim is to extract implicit information in order to express the 
datasets in a more homogeneous way. The enrichment leads to the creation of 
new objects and relations. 

A preliminary step of control is then planned: the intra- database control. 
During this step, part of the specifications is checked so as to detect some internal 
errors and determine how the data instances globally respect specifications. 

Once the data of both databases have been independently controlled, we 
match them. The relationships are computed through geometric and topologic 
data matching. Each pair is then validated and characterised by a degree of 
confidence. 

The inter-database control follows this step. It consists in the comparison 
of the representations of the homologous objects. This comparison leads to the 
evaluation of the conformity of the differences. At the end, differences existing 
between each matching pair are expressed in terms of equivalence (differences 
justified by the specifications), inconsistency (matching or capture error) or up- 
date. 

After the interpretation of all the differences, a global evaluation is supplied: 
the number of equivalencies, the number of errors and their seriousness, and the 
number of differences due to updates. 

3.2 Which Knowledge Is Required to Check the Consistency? 

Our approach is founded on the use of the specifications of each GDB to guide 
the assessment of the consistency. These metadata enable to justify if the differ- 
ences in representation are normal or not, since they define the content of the 
databases and the modelization of the objects. In general, the specifications are 
described in natural language, in paper documents. Thus, this information seems 
easily exploitable for the task of checking the representations. Nevertheless, it is 
possible that the specifications only exist through the data. They are not always 
explicitely described in documents. And when they exist, the description of the 
representation often lack of exhaustivity because it is difficult to imagine all the 
possible cases liable to appear in the field. In addition, the capture constraints 
are often insufficiently formalised and can lead to ambiguous interpretation (for 
instance: when a crossroads is “vast”, a polygon is created). Other background 
knowledge is also required for the process. For example, information relating 
to the quality of the datasets is necessary in order to fix the parameters of the 
matching algorithms, or else, common geographical knowledge is useful when the 
specifications are not sufficient to explain the differences, principally in the case 
of updates. In fact, the main challenge to assess the consistency is the acquisition 
of this knowledge. Experts in the field could define the knowledge they use when 
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the specifications are not available but in general, these experts are rarely able 
to supply an explicit description of it. This key problem is well-known as the 
“knowledge acquisition bottleneck’’'’ . 



3.3 How to Acquire the Knowledge? 

To face this issue, we have decided to use supervised machine learning tech- 
niques [28,29]. This induction process is one of the solutions developed in the 
Artificial Intelligence field. Its aim is to automatically derive some rules from 
a set of labelled examples given by an expert and to apply these learning rules 
to classify other examples with unknown classification. In our context, these 
techniques can be used at several steps. First, during the specifications analysis 
phase, when capture criteria are too complex or imprecise to draw knowledge by 
hand in the form of rules. In this case, inductive tools can be performed to grasp 
the necessary knowledge and organise it. Second, during the enrichment phase. 
The learning algorithms can be used to extract implicit information. Then, dur- 
ing the matching step. Parameters of geometrical matching procedures can be 
learned (for instance, the distance threshold to select candidates). Finally, these 
techniques can help to describe each matching pairs in terms of equivalence or 
inconsistency during the inter-database control [30]. 

3.4 How to Manipulate the Knowledge in an Automatic Way? 

We have opted for the use of an expert-system. Such systems have already proved 
to be efficient in numerous fields where complex knowledge needs to be intro- 
duced. They dissociate the knowledge embedded in rules and tools in order to 
handle them [31]. This specificity gives them a large flexibility because it is pos- 
sible to introduce a large number of rules, and because the rules can be analysed, 
added, modified or removed in a very simple way. In addition, the inference en- 
gine itself determines the activation of rules while handling in a procedural way 
could turn out to be quite hard if not impossible. In other respects, we have split 
into several steps the reasonning path of the task (figure 3). In that way, we have 
adopted an approach of second generation expert-systems [31], considering the 
control over the inferences as a kind of knowledge in itself and introducing it 
explicitely in the system. 



3.5 Architecture of the System 

Two main modules compose the general architecture of our system (figure 4): 
the experimental Oxygene Geographical Information System and the Jess expert- 
system. Oxygene is a platform developed at the COGIT laboratory [32]. Spatial 
data is stored in the relational Oracle DBMS, and the manipulation of data is 
performed with the Java code in the object oriented paradigm. The mapping 
between the relational tables and the Java classes is done by the OJB library. A 
Java API exists to make the link between this platform and the second module. 
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Fig. 4. Overall architecture of the system 



the Jess rule-based system. The latter is an open source environment which can 
be tightly coupled to a code written in Java language [33]. The rules used by 
Jess originate directly from the specifications, or have been gathered with the 
learning tools. 

4 Experimentation 

4.1 The Case Study 

We have decided to implement the process described above for dealing with the 
case of traffic circles of two databases from the French National Mapping Agency: 
BDCarto and Georoute. These databases have been defined according to differ- 
ent specifications, in order to fulfil different application domains and analysis 
levels. The first one present a decametric resolution and aims at satisfying ap- 
plication needs at regional and departmental levels. The other one. Georoute, 
is a database with a metric resolution dedicated to traffic applications (figure 
5). Among these data, we will identify and select the traffic circles to study the 
inter-representations consistency. 



4.2 Pre-integration 

Two steps of our process can be connected to the pre-integration phase: the 
specifications analysis and the enrichment of spatial data. 

Specification Analysis. For both databases, the specifications mention that 
the traffic circles may have one of the two representations: a point (simple rep- 
resentation) or a group of connected point and lines (complex representation). 
The selection of the representation is dictated by the diameter of the object in 
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Fig. 5. Extract of the road theme of the two spatial databases studied 



0 



50 



100 



Diameter (m) 



BDCarto 



Georoute 



Node Type = simple crossroad 



Node Type = small traffic circle 



Node Type = large traffic circle 
Edge Vocation = ramp 




30 

Node T3q3e = traffic 
circle 





NodeT5^e= simple crossroad and Direction of cycle: giratory 




Fig. 6. Some specifications concerning the representation of the traffic circles of BD- 
Carto and Georoute 



the real world and the presence of a central reservation. These selection criteria 
are different from one database to another (figure 6). 

The traffic circles classes are thus not explicitly in the two databases. In the 
case of the point representation, a particular value of the ‘node kind’ attribute 
can be used to select them. In the case of complex representation, it is necessary 
to resort to specific geometric tools to extract the detailed representation. This 
is done in the next step of the process. 



Enrichment. The enrichment of the databases concerns both data and schemas. 
Introducing the traffic circles in the data supposes the definition of new classes 
and relations at the schema level, as their instanciation (figure 7). The creation 
of these objects is a first step towards the identification of federative eoncepts 
[26], corresponding to geographical entities defined independently of the repre- 
sentation. The correspondence between these concepts and the objects of each 
database would enable to help the creation of the unified schema. It comes close 
to the notion of ontology [34]. 
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Fig. 7. The enrichment step induces modifications at the schema and data levels 



The extraction of the traffic circles have required several steps for each 
database. First, we dealt with the simple trajfic circles. The class was created 
and instanciated, as well as the relation with the road nodes. This task did not 
present any difficulty since the simple traffic circles are road nodes for which the 
attribute nature takes the value traffic circle’ (this is the case for Georoute). 
Then, we extracted the complex traffic circle. In a first time, we computed a 
topological graph (a planar graph with faces). Each face was characterised with 
a circularity index, the number of nodes associated and the direction of the cy- 
cle. We defined specific rules concerning those criteria and introduced them in 
our decision support system. Each face was analysed and finally, only faces cor- 
responding to a traffic circle were retained. Each complex traffic circle was thus 
created with a polygonal geometry. 
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4.3 Data Correspondence Computation: Consistency Assessment 

The enrichment phase have been performed to extract implicit information re- 
quired for the consistency study, but it has also been realised to bring the struc- 
ture of the data and schemas closer, and thus to facilitate the future integration. 
In that sense, we have defined the Interdatabase Correspondence Assertion con- 
cerning the traffic circles. We can note that the assertions before the enrichment 
would have been more complex than the one declared after the creation of the 
new classes. For instance, the initial ICA could be expressed in the form of: 

SELECTION (BDC.Node.NodeType=‘sfmpfe crossroad’)BDC.Node 

— trafficCircle(GEO.SET)A(30m<diameterTrafficCircle(GEO.SET)< 50m) 
GEO.SET([2,N]Node,[2,M]RoadSection) 

A SELECTION(BDC.Node.NodeType=‘smaZZ traffic czrcZe’)BDC.Node 

“trafficCircle(GEO.SET)A(50m<diameterTrafiicCircle(GEO.SET)<100 m) 
GEO.SET([2,N]Node,[2,M]RoadSection) 

A SELECTION(trafIicCircle(BDC.SET)) 

BDC . SET ( [2 , JjNode, [2 ,K] RoadSectionwHERE(BDc .R.oadSection.Vocation=‘ramp’) ) 

“trafficCircle(GEO . SET ) a (diameterTrafiicCircle (GEO . SET ) > 100 m) 
GEO.SET([2,N]Node,[2,M]RoadSection) 

A SELECTION(BDC.Node.NodeType=‘szmpZe crossroa<i’)BDC.Node 
Og^gg SELECTION (GEO. Node. NodeType=‘frajO?c cfrcZe’)GEO.Node 

This clause shows particular conflicts as the fragmentation (‘SET’), the dif- 
ferent selection criteria (‘SELECTION’) and the decomposition conflicts (a re- 
lationship hold with a decomposition criterion). The ICA after the enrichment 
can take the following form, much simpler: 

BDC.SimpleTrafhcCircle O GEO. Simple TrafficCircle OR 
SELECTION (diameter TraflicCircle< 100 (GEO . ComplexTrafficCircle) ) 

GEO .ComplexTraflicCircle 

BDC. ComplexTrafficCircle = 

SELECTION (diameterTraflicCircle> 100 (GEO . ComplexTrafficCircle) ) 

GEO . ComplexTrafficCircle 

Intra-Database Control. A this level, the representations of the objects are 
checked to detect some internal errors. For instance, we controlled that each 
simple traffic circle was not in a cul-de-sac. The specifications of Georoute in- 
dicate that such a node can not take the traffic circle' value for the ^nature' 
attribute. Concerning the complex objects, we systematically checked the di- 
ameter, the number of nodes and the direction of the cycle. The control was 
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automated thanks to several rules activated by the expert-system. These rules 
were developed and introduced by hand. For example: 

( defrule controLdiameter-georoute 
(if diameter < 30) 

(set diameter Conformity ’’not conform”)) 

In some cases, several possible interpretations were assigned for a same rep- 
resentation since there was some uncertainty regarding the conformity of the 
representation. For example, it is not possible to control the existence of a cen- 
tral reservation even though the presence of this object govern the selection of the 
traffic circles. Some of these uncertainties were removed after the inter-database 
control. 

Spatial Data Matching. We developed specific tools to match our data. The 
algorithms use Euclidean distance and intersection criteria: objects are matched 
if they are close or if they intersect each other. Only 8% of matching errors have 
been detected interactively for a total of 124 matching pairs. 

In order to increase the reliability of the matching phase and detect these 
errors automatically, we decided to use the results of an other matching proce- 
dure. The algorithms used and developed by [21] rely on other criteria: especially 
the Hausdorff’s distance and the topological relationships. With these results, 
we only retained the identical pairs in the two procedures, that is to say, 82% of 
matching pairs. We considered them as certain. In general, the matching errors 
made with the two methods were not the same and we envisage to improve the 
algorithms exploiting the hole of the criteria (see figure 8). 




Fig. 8. Some results of the matching step computed with the two processes 
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Inter-database Control. Some internal errors were already detected during 
the first step of control but the representations of the two databases had not 
been compared. This is the aim of this phase. 

The introduction of rules by hand in the expert-system was first considered, 
but because of numerous possible cases and the complexity of some rules, we 
decided to use supervised machine learning. An example of a rule computed 
by the C5.0. algorithm [35] is presented below. It enables the detection of an 
inconsistency: 

If the type of the traffic circle in Georoute = ‘dot’ 

And the node type of the traffic circle in BDCarto = ‘small traffic circle’ 

Then the representations are inconsistent. 

This step, as well as the last steps, are work in progress. The comparison 
will lead to the classification of each matching pair in terms of equivalence and 
inconsistency. The rules gathered from the machine learning shall be analysed 
to evaluate their consistency according to the specifications. Finally, a global 
evaluation will be provided. 

5 Conclusion 

This paper has given an overview of the spatial databases integration process 
with its specificity’s, and has pointed to several key issues, among these, the 
assessment of consistency between multiple representations. 

We have proposed a new approach to deal with that problem, considering 
the specifications of the GDB as the principal knowledge to check the confor- 
mity of each representation. Because of the complexity of this knowledge and 
sometimes, its inadequacy, we have decided to adopt an approach combining 
the intervention of experts in the field and machine learning techniques to ac- 
quire these specifications. The automatic activation of the learning rules is then 
provided thanks to an expert-system. 

The process have been tested in a real context, with two spatial databases of 
the French National Mapping Agency. The application developped demonstrates 
the feasibility of the approach . The last steps of the process are being imple- 
mented. Further research has to focus on the interpretation of differences in a 
context where no explicit capture constraints exist. 
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Abstract. Many geographical applications deal with spatial objects that cannot 
be adequately described by determinate, crisp concepts because of their intrinsi- 
cally indeterminate and vague nature. Current geographical information systems 
and spatial database systems are unable to cope with this kind of data. To sup- 
port such data and applications, we introduce vague spatial data types for vague 
points, vague lines, and vague regions. These data types cover and extend previ- 
ous approaches and are part of a data model called VASA {Vague Spatial Algebra). 
Their formal framework is based on already existing, general exact models of 
crisp spatial data types, which simplifies the definition of the vague spatial model. 
In addition, we obtain executable specifications for the operations which can be 
immediately used as implementations. This paper gives a formal definition of the 
three vague spatial data types as well as some basic operations and predicates. 
A few example queries illustrate the embedding and expressiveness of these new 
data types in query languages. 



1 Introduction 

The current mapping of spatial phenomena of the real world to exclusively crisp, i.e., 
precisely determined, spatial objects is an insufficient abstraction process for many geo- 
metric applications since often the feature of spatial vagueness or spatial indeterminacy 
is inherent to many geometric and geographic data [2]. Applications based on this kind 
of geometric data are so far not covered by current GIS and spatial database systems. 

So far, often contrary to reality, spatial data modeling implicitly assumes that the po- 
sitions of points, the locations and routes of lines, and the extent and hence the boundary 
of regions are precisely determined and universally recognized. The properties of the 
space at points, along lines, and within regions are given by attributes whose values are 
assumed to be constant over the whole objects. Examples are man-made spatial objects 
(e.g., monuments, highways, buildings) and predominantly immaterial spatial objects 
(e.g., countries, districts, land parcels with their political, administrative, and cadastral 
boundaries). We denote this kind of entities as crisp or determinate spatial objects. 

On the other hand, there are many geometric applications in which positions of points 
are not exactly known, the locations and routes of lines are unclear, and regions do not 
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have sharp boundaries, or their boundaries cannot be precisely determined. Examples 
are social or natural phenomena (e.g., terrorists’ refuges and escape routes, population 
density, unemployment rate, soil quality, vegetation, oceans, oil fields, biotopes, deserts). 
We denote this kind of entities as vague or indeterminate spatial objects. 

This paper presents an object model for defining vague spatial data types for vague 
points, vague lines, and vague regions. These types are part of a data model called VASA 
(Vague Spatial Algebra). The model rests on “traditional” (i.e., exact) modeling tech- 
niques and extends, rather than replaces, the current theory of spatial database systems 
and GIS. Further, moving from an exact to a vague domain does not necessarily inval- 
idate conventional (computational) geometry; it is merely an extension. Hence, exact 
object models can be considered as special cases of our vague object model. All vague 
spatial data types and several vague spatial operations are defined generically, i.e., with- 
out type-specific definitions. Since our vague spatial data types and operations are based 
on their crisp counterparts and can be expressed by them, we obtain executable specifi- 
cations that can be directly used as an implementation. In this paper, we do not aim at 
developing a type system with a “complete” set of operations and predicates. The goal 
is more to demonstrate the power, simplicity, and expressiveness of our model. 

Section 2 discusses related work. Section 3 informally introduces the concept of 
vague spatial objects and motivates it by giving some application examples. Section 4 
gives a generic definition of vague spatial data types and vague spatial set operations. 
Section 5 deals with type-specific operations. Section 6 introduces some vague topo- 
logical predicates. Section 7 illustrates the embedding of vague spatial data types into 
query languages. Finally, Section 8 draws some conclusions and addresses future work. 



2 Related Work 

Spatial vagueness has to be seen in contrast to spatial uncertainty resulting from either 
a lack of knowledge about the position and shape of an object (positional uncertainty) 
or the inability of measuring such an object precisely (measurement uncertainty). Much 
literature, which we will not consider here, has been published on dealing with positional 
and measurement uncertainty; it mainly proposes probabilistic models. Spatial vague- 
ness is an intrinsic feature of a spatial object where we cannot be sure whether certain 
components belong to the spatial object or not. Our vague spatial data types cover both 
aspects of spatial uncertainty and spatial vagueness. 

Three main alternatives have been proposed as general design methods. Models based 
on fuzzy sets (e.g., [1,11]) are all based on fuzzy set theory, allow a much more fine- 
grained modeling of vague spatial objects, but are computationally much more expensive 
with respect to data structures and algorithms. Models based on rough sets (e.g., [13]) 
work with lower and upper approximations of spatial objects, which is similar to our 
approach. But the formal background is rather different. Models based on exact spatial 
objects (e.g., [3,4,10,6] extend data models, type systems, and concepts for crisp spatial 
objects to vague spatial objects. The model described in this paper belongs to this latter 
category. 

A benefit of the exact object model approach is that existing definitions, techniques, 
data structures, algorithms, etc., need not be redeveloped but only modified and extended. 
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or simply used. So far, four object models have been proposed for vague regions. The first 
three models use some kind of zone concept, either without holes [3,4] or with holes 
[10]. The central idea is to consider determined zones surrounding the indeterminate 
boundaries of a region and expressing its minimal and maximal extension. The zones 
serve as a description and separation of the space that certainly belongs to the region and 
the space that is certainly outside. While [3] and [4] are mainly interested in classifications 
of topological relationships between vague regions for which a simple model is assumed, 
[10] proposes a model of complex vague regions with vague holes and focuses on their 
formal definition. Unfortunately, the three approaches are limited to “concentric” object 
models and have problems with geometric closure properties. The model described in 
[6] also pursues the exact model approach but is much more general and much simpler 
than the other approaches. It is a precursor of this paper and introduces the concept of 
vague regions. 

3 What Are Vague Spatial Objects? 

The central idea of vague spatial objects is to base their definition on already well known, 
geometric modeling techniques. Our concept necessitates a general object model incor- 
porating determinate spatial data types point, line, and region that are closed under 
(appropriately defined) geometric union, intersection, difference, and complement oper- 
ations. Such crisp type systems have, e.g., been proposed in [9,7], and we will take them 
and their corresponding formal definition for granted in this paper. Informally, these 
models consider a point object as a finite set of individual points, a line object as a finite 
set of disjoint blocks where each block represents a finite set of curves, and a region 
object as a finite set of disjoint, connected areal components (called faces) possibly with 
disjoint holes (see Figure 1). 



Fig. 1. Examples of a (complex crisp) point object (a), a (complex crisp) line object (b), and a 
(complex crisp) region object (c). Each collection of components forms a single crisp object. 

As an illustrating example, we consider a homeland security scenario to introduce 
our concept for dealing with spatial vagueness and to demonstrate its usability. Secret 
services (should) have knowledge of the whereabouts of terrorists. For each terrorist, 
some of their refuges are precisely known, some are not and only conjectures. We can 
model these locations as a vague point object where the precisely known locations are 
called the kernel point object and the assumed locations are denoted as the conjecture 
point object. Secret services are also interested in the routes a terrorist takes to move 







(a) 



(b) 



(c) 
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Fig. 2. Examples of a (complex) vague point object (a), a (complex) vague line object (b), and a 
(complex) vague region object (c). Each collection of components forms a single vague object. 



from one refuge to another. These routes can be modeled as vague line objects. Some 
routes, called kernel line objects, have been identified. Other routes can only be assumed 
to be taken by a terrorist; they are denoted as conjecture line objects. Knowledge about 
areas of terroristic activities is also important for secret services. From some areas it is 
well known that a terrorist operates in them; we call them kernel region objects. From 
other areas we can only assume that they are the target of terroristic activity; we denote 
them as conjecture region objects. Figure 2 gives some examples. Grey shaded areas, 
straight lines, and grey points indicate kernel parts; areas with white interiors, dashed 
lines, and white points refer to conjecture parts. 

Based on this scenario and taking into account spatial vagueness, we are able to pose 
interesting queries. We can ask for the locations where any two terrorists have taken 
the same refuge. We can determine those terrorists that operated in the same area. We 
can compute the locations where routes taken by different terrorists crossed each other. 
Many further queries are possible. Vague concepts offer a greater flexibility for modeling 
properties of spatial phenomena in the real world than determinate concepts do. Still, 
vague concepts comprise the modeling power of determinate concepts as a special case. 

In this sense, many scenarios can be found that could make meaningful use of the 
concept of vague spatial objects. They all have in common that a vague spatial object 
(e.g., a vague line) is described by a pair of two disjoint or adjacent crisp spatial objects 
(e.g., two crisp lines). The first crisp spatial object, called the kernel part, describes the 
determinate component of the vague object, that is, the component that definitely and 
always belongs to the vague object. The second crisp spatial object, called the conjecture 
part, describes the vague component of the vague object, that is, the component from 
which we cannot say with any certainty whether it or subparts of it belong to the vague 
object or not. Maybe the conjecture part or subparts of it belong to the vague object, 
maybe this is not the case. Or we could say that this is unknown. 

4 A Generic Definition of Vague Spatial Data Types and Vague 
Spatial Set Operations 

Based on the motivation in the previous section, we now give a formal definition of 
vague spatial data types (Section 4.1) and vague spatial set operations (Section 4.2). An 
interesting observation is that these definitions can be given in a generic manner, i.e., 
type-specific considerations are unnecessary. At the end. Section 4.3 introduces a few 
other generic operations as well as predicates. 
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4.1 Vague Spatial Data Types 

For the definition of vague points, vague lines, and vague regions we make use of the 
data types point for crisp points, line for crisp lines, and region for crisp regions. All 
crisp spatial data types a € {point, line, region} are assumed to have a complex inner 
structure as it has been defined, e.g., on the basis of point sets and point set topology in 
[7], or in concrete implementations in [8]. In particular, this means that a point object 
includes a finite number of single points, a line object is assembled from a finite number 
of curves, and a region object consists of a finite number of disjoint faces possibly 
containing a finite number of disjoint holes. Further, these types must be closed under 
the geometric set operations union (0 : a x a — a), intersection (0 : a x a — >■ a), 
difference (Q : a x a ^ a), and complement a — a). Each type a together with the 
operations © and 0 forms a boolean algebra. The identity of 0 is denoted by 1, which 
corresponds to IR^. The identity of © is presented by 0, which corresponds to the empty 
spatial object (empty point set). 

Syntactically, the extension of a crisp spatial data type to a corresponding vague type 
is given by a type constructor v as follows: 

v{a) = 0 X 0 ; Va € [point, line, region} 

That is, each vague spatial data type is represented as a pair of corresponding crisp spatial 
data types. For example, for a = point we obtain v{point) = point x point, which we 
also name vpoint. Accordingly, the data types vline and vregion are defined. For a vague 
spatial object w = (k, c) G v{a), we call fc G a the kernel part of w, and c G a denotes 
the conjecture part of w. 

Semantically, the kernel part represents the determinate, crisp part of w, i.e., the area 
which definitely and always belongs to w. The conjecture part describes the vague part 
of w, i.e., the area for which we cannot say with any certainty whether it or parts of 
it belong to w or not. Maybe it or parts of it belong to w, maybe this is not the case. 
We could also say that this is unknown or unclear and thus a conjecture. To enable this 
intended semantics, we require: 

Va G [point, line, region} \/w = (k,c) € v(a) : disjoint{k,c) V meet{k,c) 

The functions disjoint and meet, which operate on complex crisp spatial objects, denote 
generalized versions [12] of the well known topological predicates on simple spatial 
objects [5]. 

Let points : v{a) — >■ IR^ be an auxiliary function that yields the (unknown) point set 
of a vague spatial object. For an object w = (k,c) G v{a) we can then conclude that 

k L points{w) C fc © c 

Hence, k can be regarded as a lower (minimal, guaranteed) approximation of w and 
fc © c can be considered as an upper (maximally possible, speculative) approximation of 
w, which brings us near to rough set theory. Even if we do not know the exact point set 
of w, we assume and require that points{w) is not arbitrary but compatible to a, i.e., 



points{w) G a and points{w) Qk € a 
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Using the characteristic function x deciding about the existence or non-existence of an 
element in a set, we obtain x{p) = 1 for all p G k, x{p) = 0 for all p G IR^ — (/c U c), 
x(p) = 1 V x(p) = 0 for aWp G c — k, and x(p) = 1 for all p € points{w) S a. Note the 
deliberate use of set-theoretic operations. Especially common boundary points of k and 
c (fc n c 0) are mapped to 1. 

4.2 Vague Spatial Set Operations 

The three vague geometric set operations union, intersection, and difference have all the 
same signature v{a) x v{a) —>■ v(a). In addition, we define the operation complement 
with the signature v(a) —>■ v(a). It is our goal and makes sense to define them in a 
type-independent and thus generic manner. In order to define them for two vague spatial 
objects u and w, it is helpful to consider meaningful relationships between the kernel 
part, the conjecture part, and the outside part of u and w. For each operation we give 
a table where a column/row labeled by k, c, or o denotes the kernel part, conjecture 
part, or outside part of u/w. Each entry of the table denotes a possible combination, i.e., 
intersection, of kernel parts, conjecture parts, and outside parts of both objects, and the 
label in each entry specifies whether the corresponding intersection belongs to the kernel 
part, conjecture part, or outside part of the operation’s result object. 



Table 1. Components resulting from intersecting kernel parts, conjecture parts, and outside parts 
of two vague spatial objects with each other for the four vague geometric set operations. 
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The union (Table 1 ) of a kernel part with any other part is a kernel part since the union 
of two vague spatial objects asks for membership in either object and since membership 
is already assured by the given kernel part. Likewise, the union of two conjecture parts 
or the union of a conjecture part with the outside should be a conjecture part, and only 
the parts which belong to the outside of both objects contribute to the outside of the 
union. 

The outside of the intersection (Table 1) is given by either region’s outside because 
intersection requires membership in both regions. The kernel part of the intersection 
only contains components which definitely belong to the kernel parts of both objects, 
and intersections of conjecture parts with each other or with kernel parts make up the 
conjecture part of the intersection. 

Obviously, the complement (Table 1) of the kernel part should be the outside, and 
vice versa. With respect to the conjecture part, anything inside the vague part of an object 
might or might not belong to the object. Hence, we cannot definitely say that the com- 
plement of the vague part is the outside. Neither can we say that the complement belongs 
to the kernel part. Thus, the only reasonable conclusion is to define the complement of 
the conjecture part to be the conjecture part itself. 



Vague Spatial Data Types, Set Operations, and Predicates 



385 



The definition of difference (Table 1) between u and w can be derived from the 
definition of complement since it is equal to the intersection of u with the complement of 
V. That is, removing a kernel part means intersection with the outside which always leads 
to outside, and removing anything from the outside leaves the outside part unaffected. 
Similarly, removing a conjecture part means intersection with the conjecture part and 
thus results in a conjecture part for kernel parts and conjecture parts, and removing the 
outside of w (i.e., nothing) does not affect any part of u. 

Motivated by the just informally described, intended semantics for the four opera- 
tions, we now define them formally. An interesting aspect is that these definitions can 
be based solely on already known crisp geometric set operations on well-understood 
exact spatial objects. Hence, we are able to give executable specifications for the vague 
geometric set operations. This means, if we have the implementation of a crisp spatial 
algebra available, we can directly execute the vague geometric set operations without 
being forced to design and implement new algorithms for them. 

Let u,w G v{a), and let and denote their kernel parts, and their con- 
jecture parts, and u° and w° their outside parts. The outside of u, e.g., is defined as 
u° 0 u“). We define: 



u union w 
u intersection w 
u difference w 
complement u 



{vfi © © u;“) © {w^ © w^)) 

('U'^ © w^) © © w^) © © (~w'"))) 



We introduce juxtaposition as an abbreviating notation for the intersection of two 
crisp spatial objects and assign intersection higher associativity than union and dif- 
ference. Hence, the above definition for u difference w could also be specified more 
concisely as 

In a next step, we have to prove that the definitions realize the behavior specified in 
Table 1. For z = u union w we have to show the three identities (1) ® 

u^w° © © u°w^, (2) z'^ = © u‘^w° © u°w^, and (3) z° = u°w°. The proof for 

(1) leverages that © is idempotent. We can therefore duplicate the first term . Then 
using the fact that © distributes over © we can factorize both and and obtain: 
0 (u>^ © u") ) ) . Since (B (B w° = 1 = IR^ and 

u* © © tt° = 1 , where 1 is the identity of ©, we get z^ = © 1 ) © © 1 ) = 

vfi (Bw^, which is the definition of the kernel part of union. 

For proving equation (2) we know that for r,s Ga holds: r © s = rs © r (^s) © (~r) s. 
We can use this identity to rewrite the conjecture part definition as © (~ 

© {vfiw^ © © (~tt^)w^). Now we evaluate all complements by using 

that (Bw° and © w°. This leads to © w°) © (it^ © 

u°)w‘^GB{u’^w^®u^{w‘^®w°) © (M'^©ti°)w^). Applying distributivity of © we obtain: 
(Bu'^w^G) u°w‘^ © © vfiw° © © u°w^). In this 

term, only u'^w^ and appear in both parts of the difference; all other intersections to 
be subtracted have no effect at all since all intersections are pairwise disjoint. We obtain: 
© u‘^w° © u°w^ which corresponds exactly to the condition required for z'^. 

For proving equation (3) we note that in a boolean lattice for r,s Ga holds: 1 © s = s, 
1 © s = 1 , and 1 = r © (^r). Therefore, we know that s = (r © (^r))s = rs © {^r)s, 
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and it follows that r0s = r0rs0 (~r)s. We also know that r 0 rs = r(l 0 s) = r so 
that r 0 s = r 0 (~r)s. Since (~r-)s is another way of denoting the difference s 0 r, 
we get: r 0 (s 0 r) = r 0 s. Now we have by definition that z° =~(z^ 0 z^) 0 

w'^ 0 0 w^) 0 0 w^))) =^{u^ 0 0 0 w^). By commutativity and de 

Morgan’s law this reduces to 0 w'^)) 0 {^{w^ 0 w^)) which is by the definition 

of complement equal to u° ®w° , the condition required for z°. 

Due to their lengthiness, we omit the proofs of the other three operations here whose 
correct behavior can be shown in a similar way. 

4.3 Other Generic Operations and Predicates 

Sometimes it is helpful to be able to explicitly deal with the kernel part or the conjecture 
part of a vague spatial object, or to swap its kernel part and conjecture part. For that 
purpose, we define the following generic operations for u = G v{a): 

kernel(u) := (u^,0) 

conjecture(rt) := (0,u^) 

invert(it) := 

All three operations have the signature u(a) — u(o;)*. The kernel operation espe- 
cially facilitates computations exclusively with the exact part of a vague spatial object 
u, because the vague spatial operations, applied to vague spatial objects with an empty 
conjecture part, behave exactly like the corresponding crisp spatial operations. This can 
be easily seen from the definitions and is intended. Consequently, crisp spatial objects 
are a special case of their corresponding vague counterparts. The conjecture operation 
allows one to focus on the unclear, indeterminate part of u. The invert operation changes 
the role of kernel part and conjecture part of u. 

It is also possible to identify generic predicates. The most obvious ones are =,y^: 
v{a) — >■ bool. For u,w G v{a), they are defined as follows: 

u = w := {u^ = A u‘^ = w'^) 

u^w := V 

5 Type-Specific Spatial Operations 

In this section we describe a few operations requiring particular types as operands. 
The operation boundary with the signature vregion vline allows us to extract the 
boundary of a vague region as a vague line. Its definition requires the crisp operation 
boundary : region line which determines the boundary of a crisp region as a crisp line. 
Vice versa, the operation interior with the signature vline -A vregion determines faces in a 
vague line object and transforms them into a vague region. Its definition requires the crisp 
operation interior : line -A region which calculates the faces of a crisp line and collects 

* To really connect vague and crisp spatial data types and to map the kernel part or the conjec- 
ture part of a vague spatial object to the corresponding crisp spatial object, one could define 
projection functions 7rj.,7Tc : v{a) — >■ a with and 7rc(u) = 
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them into a crisp region. For r € vregion in particular interior (boundary (r)) = r holds, 
i.e., the operations interior and boundary are inverse. The operation vertices with the 
signatures vline vpoint and vregion vpoint collects the end points of the segments 
of a vague line and the end points of the segments of the boundary of a vague region 
respectively. Its definition is based on the crisp operation vertices with the signatures 
line — >■ point and region point. Let further I € vline. We define: 



boundary(r) 

interior(Z) 

vertices(Z) 

vertices(r) 



{boundary {r ^ ) , boundary{r'^)) 
{interior {l^) ^interior {1'^)) 
{vertices{l'^) ,vertices{l^)) 
{vertices{r^),vertices{r'^)) 



We have so far described the intersection operation only for two vague spatial objects 
of the same type. We extend this definition now to all mixed type combinations with the 
signatures vpoint x vline — >■ vpoint, vpoint x vregion — >■ vpoint, vline x vregion — ^ vline, 
vregion X vpoint — >■ vpoint, vregion X vline — ^ vline, and vline x vpoint — >■ vpoint. For 
this purpose, we generalize the crisp intersection operation (g) to all corresponding crisp 
variants of the just mentioned signatures. These crisp variants are well defined. The 
already known definition for intersection can then also be applied in case of the mixed 
type combinations. 

The operation common_border incorporates a special kind of intersection. It com- 
putes the shared boundary of two extended vague spatial objects as a vague line. Its 
signatures are vline x vline — >■ vline, vline x vregion — >■ vline, vregion x vline — ^ vline, 
and vregion x vregion vline. The definitions can be reduced to the well defined crisp 
spatial operation common _border : line x line line which computes the common line 
parts of two crisp lines. We define for l,m G vline and r,s G vregion: 



common_border(/, m) 
common border ( Z , r ) 
common border(r, 1) 
common_border(r, s) 



( common Jborde r{l^^ ,m^), common _border{l'^,m‘^)) 

common border ( Z , boundary ( r ) ) 
common border ( Z , r ) 

common_border(boundary (r ) , boundary (s) ) 



Finally, we discuss the vague operator convexJiull : vpoint vregion. A subset 
S of the plane is called convex if and only if for any pair of points p,q G S the line 
segment between p and q is completely contained in S. The well known crisp operation 
convex Jiull : point region computes the smallest convex region that contains all points 
of a given point object. We define the convex hull of a vague point p G vpoint as 

convex_bull(p) := {convex Jiull{p'^), convex _hull{p’^ ©p“) Q convex _hull{p^)) 



The smallest, guaranteed convex hull of p is given by the convex hull of its kernel 
part. If all points of the conjecture part of p lie inside or on the boundary of the convex 
hull of its kernel part, the conjecture part of the resulting vague region is 0 , i.e., the empty 
region object. Otherwise, the convex hull involving all points both from the kernel part 
and the conjecture part will be larger than the convex hull of the kernel part. 
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6 Vague Topological Predicates 

Topological predicates provide information about the relative position of spatial objects 
towards each other. The result type of vague topological predicates^ is a value of a new 
vague data type named vbool = {true, false, maybe}. That is, we use a three-valued logic 
as the range of these predicates. The definition of the vague logical operators and, or, 
and not (Table 2) parallels the definition of the vague spatial operations in Section 4.2 
{t, /, and m are used as abbreviations for true, false, maybe). 



Table 2. Vague logical operators (three-valued logic). 
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We first consider a generic definition of the vague inside predicate which has the 
signatures vpoint x a — >■ vbool with a G {vpoint, vline, vregion}, vline x /3 — >■ vbool with 
(3 € {vline , vregion} , and vregion x vregion — ^ vbool. Let u be the first operand and w 
be the second operand according to the signatures. Then their definition is as follows: 

{ true if C 

false if (B w'^ 

maybe otherwise 

Hence, we can safely say that u inside w holds if everything of u (i.e., kernel part and 
conjecture part) is inside the kernel part of w. If this is not the case, we cannot simply 
conclude that u inside w is false since this requires definite knowledge about a part of 
u being outside any part of v. In other words, if we can exclude true as the predicate 
result and if C © w'^, we are not sure about insideness, and we define u inside w 
as maybe. 

Next we present a generic definition of the vague intersects predicate which has the 
signature a x a — >■ vbool with a € {vpoint, vline, vregion}. We define for u,w G a: 

{ true if 0 

false = 

maybe otherwise 

The definition means that the predicate holds if the kernel parts of u and w intersect. 
This is true independent from the value of u'^w'^. Likewise, if and are 

disjoint, we can definitely say that u and w do not intersect at all. However, if = 0 
and 0, we cannot be sure about the intersection of u and w and let the predicate 

return the value maybe. 

^ In this paper, we deliberately omit the discussion of a “complete” collection of vague topological 
predicates for which we are currently designing a comprehensive concept that will be presented 
in a future paper. 
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The predicate on_border_of : vpoint x ct — ?► vbool with a € {vline,vregion} checks 
whether a vague point is located on a vague line and a vague region respectively. Let 
p S vpoint, I € vline, and r G v region. We use the operation boundary defined in Section 5 
to compute the boundary of a vague region as a vague line and define: 

on border of(p,/) := inside(p, Z) 

on_border_of(p, r) := inside(p,boundary(r)) 

The predicate border jn.common has the signatures vline x vline -G- vbool, vline x 
vregion — >■ vbool, vregion x vline — >■ vbool, and vregion x vregion — >■ vbool. It determines 
whether two extended vague spatial objects share a common border. Let l,mG vline and 
r,s G vregion. Then 



border _in_common( Z, m) 



border jn_common(Z, r) 
border in common(r, 1) 
border Jncommon(r, s) 



{ true if 0 A G vline 

false if = 0 V 

I’^m’^ © Z'^m^ © I'^m'^ © ^ vline 

maybe otherwise 

border _in_common ( Z , boundary ( r) ) 
border in common (Z, r) 

border in common(boundary(r),boundary(s)) 



7 Embedding Vague Spatial Data Types into Query Languages 

A few examples shall demonstrate the integration of vague spatial concepts into the 
relational data model and the query language VSQL (vague SQL). We do not give a full 
description of VSQL. Vague spatial data types are embedded as abstract data types into 
a relational schema and may be used as attribute types like standard types. That is, their 
internal structure is hidden from the user and can only be accessed by operations. For 
instance, we may specify the relation weather(climate: string, region: vregion), where the 
column named region contains vague region values for various climatic conditions given 
by the column climate, and the relation soil(quality: string, region: vregion) describing 
the soil quality for certain regions. 

If we want to find out all regions where lack of water is a problem for cultivation, 
or if we are interested in bad soil regions as a hindrance for cultivation, we can pose the 
following queries: 

select region from weather where climate = “dry” 

select region from soil where quality = “bad” 

Note that the result of both queries is a set of vague regions. 

If we want to find out about vague regions where cultivation is impossible due to a 
lack of water or bad soil quality, we ask for the union of two vague region sets. Thus, 
we first have to cast the sets into single vregion objects. We therefore use the built-in, 
overloaded aggregation function sum which, when applied to a set of vague regions, 
aggregates this set by repeated application of union. We can determine regions where 
cultivation is impossible by: 
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(select sum(region) from weather where climate = “dry”) 

union 

(select sum(region) from soil where quality = “bad”) 



As a next example, we take the self-explaining relations pollution(type: string, region: 
vregion) and areas(use: string, region: vregion). Pollutions are nowadays a central eco- 
logical problem and cause an increasing number of environmental damages. Important 
examples are air pollution and oil soiling. Pollution control institutions, ecological re- 
searchers, and geographers, usually use maps for visualizing the expansion of pollution. 
We can ask, for example, for inhabitable areas which are air polluted (where the kernel 
part of an air pollution denotes heavily polluted areas and the conjecture part only gives 
slightly polluted regions) as follows: 

select sum(pollution. region) intersection sum(areas.region) 

from pollution, areas 

where area.use = “inhabited” and pollution. type = “air” 

Then the kernel part of the result consists of inhabited regions which are heavily polluted, 
and the conjecture part consists (a) of slightly polluted inhabited regions, (b) of heavily 
polluted regions which are only partially inhabited, and (c) of slightly polluted and 
partially inhabited regions. 

If we want to reach all people who live in heavily polluted areas, we need the kernel 
part of the intersection together with conjecture part (b) of the intersection. How can 
we get this from the above query? The trick is to force boundary parts (a) and (c) to be 
empty by restricting pollution areas to their kernel region: 

select kernel(sum(pollution.region)) intersection sum(areas.region) 

from . . . 

A slightly different query is to find out all areas where people are definitely or possibly 
endangered by pollution. Of course, we have to use an intersection predicate. More 
precisely, we want to find those areas for which intersects either yields true or maybe. 
For this purpose we can prefix any predicate with maybe which causes the predicate 
to fail only if it returns /a/^e. Technically maybe turns a maybe value into true. So the 
query is: 

select areas. name 

from pollution, areas 

where area.use = “inhabited” and pollution. region maybe intersects ar- 
eas. region 

This query is an example of a vague spatial join. 

The following example is based on the the self-explaining relations resources(kind: 
string, region: vregion) and nature(type: string, region: vregion) and describes a situation 
which stresses the conflicting interests of economy and ecology. Assume on the one hand 
areas of animal species and plants that are worth being protected (nature reserves and 
national parks are the kernel parts) and on the other hand mineral resources the mining 
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of which prospects high profits. An example for forming a difference of vague regions 
is a query which asks for mining areas that do not affect the living space of endangered 
species. 

(select sum(region) from resources where kind = “mineral”) 

difference 

(select sum(region) from nature where type = “endangered”) 

The kernel part of the result describes regions where mining should be allowed. The 
conjecture part consists (a) of regions where mineral resources are uncertain and (b) of 
resource kernels that lie in (non-kernel) regions hosting endangered species. Since na- 
tional parks are generally protected by the government, it is especially regions (b) con- 
servationists should carefully observe. We can determine these regions by: 

(select kernel(sum(region)) from resources where kind = “mineral”) 

intersection 

(select conjecture(sum(region)) from nature where type = “endangered”) 

The result is a vague region with an empty kernel part and a conjecture part that just 
consists of the intersection of the mineral resource kernel part and the endangered nature 
conjecture part. 

Next, we consider an example from biology and assume living spaces of different 
animal species stored in a relation animals(name: string, region: vregion). The kernel 
part describes places where they normally live, and the conjecture part describes regions 
where they can be found occasionally(e.g., to hunt for food or to migrate from one kernel 
area to another through a corridor). We can search for pairs of species which share a 
common living space. This query asks for regions which have a non-empty intersection 
kernel: 

select A. name, B.name 

from animals as A, animals as B 

where A.region intersects B. region 



8 Conclusions and Future Work 

We have defined a data model of points, lines, and regions that is capable of describing 
many different aspects of spatial vagueness. It is a canonical extension of determinate 
spatial data models, which facilitates the treatment of vague and exact objects in one 
model. Since our approach is based on exact spatial modeling concepts, it allows to 
build upon existing work and simplifies many definitions. In particular, we can leverage 
already existing implementations of crisp spatial type systems to realize vague spatial 
objects with only minimal effort by executable specifications. 

Currently we are working on a comprehensive concept for vague topological predi- 
cates and also deal with vague numerical operations. Implementation tuning is another 
topic. 
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Abstract. This paper describes a framework for highly distributed 
real-time monitoring approach to database security using Intelligent 
Multi- Agents. The intrusion prevention system described in this paper 
uses a combination of both statistical anomaly prevention and rule 
based misuse prevention in order to detect a misuser. The statistical 
anomaly prediction system employs ensemble Quickprop neural net- 
works forecasting model, which predicts unauthorized invasions of user 
based on previous observations and takes further action before intrusion 
occurs. The experimental study is performed using real data provided 
by a major Corporate Bank. A comparative evaluation of the two 
ensemble networks over the individual networks was carried ont using 
mean absolute percentage error on a prediction data set and a better 
prediction accuracy has been observed. The Misuse Prevention system 
uses a set of rules that define typical illegal user behavior. A separate 
rule subsystem is designed for this misuse detection system and it is 
known as Temporal Authorization Rule Markup Language (TARML). 
In order to rednce single point of failures in centralized security system, 
a dynamic distributed system has been designed in which the security 
management task is distributed across the network using Intelligent 
Multi- Agents. 

Keywords: Multi-Agents - Database Security - Quickprop Prediction 
Technique - Neural Networks. 



1 Motivation 

In information systems, the primary security threat comes from insider abuse 
and from intrusion. Security policies do not sufficiently guard data stored in a 
database system against “privileged users” [1]. Many intrusions into informa- 
tion systems manifest through the significantly increased or decreased intensity 
of transactions occurring in information systems. For example, intruders who 
have gained super-user privileges can perform malicious transactions and disable 
many resources in the information system, resulting in the abruptly decreased 
intensity of transactions [4]. This reinforces the point that intrusion detection 
systems should not only be employed at the network and hosts, but also at 
the database systems where the critical information assets lie [6] . Therefore, the 
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early detection of significant changes in the data object usage can help stop 
many intrusions early to protect information systems and assure reliability of 
information systems [2]. 

2 Related Works 

The existing intrusion detection systems [1] [6] operate in real time, capturing 
the intruder when or after intrusion occurs. From the existing methods of de- 
tecting the intrusion [1] [6] , we observed that all intrusion detection systems were 
lacking a vital component: that they take action, after the intrusion has been 
detected [9]. This serious weakness has led to the research on forecasting models. 
However, though the Intrusion Detection system is real-time, it can detect the 
intrusion after the action, but never before [2]. To address the problem of detect- 
ing intrusions after they take place, we utilize a Quickprop neural network(NN) 
ensemble prediction algorithm, which takes into account user behavior and gen- 
erates a predicted profile to foresee the future user actions. 

One of the most difficult problems for neural network modeling is selection of 
proper neural network structure. Usually single network fails to capture all the 
intricacy present in the data. Ensemble uses many neural network outputs to 
jointly solve a problem and at the same time improves the generalization ability 
of the network significantly [3]. Hence, prediction error of a combined network 
(ensemble) is less as compared to individual networks. Analysis was performed 
for different network architectures by varying number of hidden layers, hidden 
neurons and types of activation functions. 

Existing works on Intrusion Detection has focused largely on network [9] 
and host intrusion [1] . Most of the research on database security revolve around 
access policies, roles, administration procedures, physical security, security 
models and data inference. Little amount of work is done on database IDSes. 
Although most emphasis in literature has been found for Network IDSes [6]. 
DIDAFIT [6] is a database intrusion detection system that identifies anomalous 
database accesses by matching SQL statements with a known set of legitimate 
database transaction fingerprints. Chung [1] work, a method was devised which 
generates profiles of the users and their roles in a relational database system. 
This method assumes that the legitimate users show some level of consistency in 
using the database system. If this assumption does not hold, or if the threshold 
for inconsistency is not set properly, the result will be a high level of false 
positives. It also faces the attribute selection problem like choosing a feature 
in building a work scope. To the best of the author’s knowledge, there is no 
report on an intrusion prevention system for database. As far as the authors 
know, this is the only work using neural networks forecasting model and SQL 
transaction rules to prevent database intrusions. 

2.1 User Audit Profile 

Our Intrusion Detection system uses hybrid detection technique. Thus, the user 
profile is a collection of real-time negative authorization rules stated by database 
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administrators and audit record. The rules include the access of database ob- 
jects of the network computer system for which permission is not granted, data 
objects that users cannot use on their hosts, and even includes privileges that 
the database administrators feel that the users should not use. 



2.2 Temporal Authorization Rule Markup Language 

In our model, a negative authorization is specified as (time, auth), where time 
is a temporal attribute, and auth (s,o,p) is an authorization. Here, temporal 
represents either valid time or transaction time, during which auth is invalid, 
s represents the subject, o represents the database object and p, the privi- 
lege. These rules are represented by means of ETCA(Event-Time-Condition- 
Action) [5] rules. Stating the purpose of ETCA rules briefly, whenever the event 
takes place the negative authorization condition corresponding to it’s checked 
and if the condition is satisfied then the defense action to be performed on a 
user when an attack signature is detected. 

<?xml version=”1.0” encoding=”UTF-8”?> 

<!DOCTYPE Rules[ 

<!ELEMENT Rules(Rule)*> 

< ! ELEMENT Rule (Rulename ,E vent , Condition , Action) > 

< [ELEMENT Rulename(#PCDATA)> 

< [ELEMENT Event (Eventname,DataEvent,TemporalEvent, 

Other Event) > 

< [ELEMENT Eventname(#PCDATA)> 

<[Attlist Eventname Timestamp PCDATA #REQUIRED> 

< [ELEMENT DataEvent(Create,Update, Delete, View) > 

<[ELEMENT Create(Node)> 

<[ELEMENT Update(Node)> 

< [ELEMENT Delete(Node)> 

< [ELEMENT View(Node)> 

< [ELEMENT Node(#PCDATA)> 

< [ ELEMENT TemporalEvent ( At , After , Before , Between , Every) > 

<[ELEMENT At(Time)> 

<[ELEMENT After(Time)> 

<[ELEMENT Before(Time)> 

< [ELEMENT Between(Time)> 

< [ELEMENT Every (Time) > 

<[ELEMENT OtherE vent (Data_too_Large, Division By Zero)> 

<[ELEMENT Condition (User,Domain,Role,Priority,Context, 
XQUERYj3tmt)> 

<[ELEMENT Rule (User)-b> 

<[ELEMENT User (Domain)-|-> 

< [ELEMENT Domain (Role)-|-> 

< [ELEMENT Role (Object) -b> 

< [ELEMENT User (EMPTY) > 

<[ATTLIST User usermame CDATA#REQUIRED> 
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< [ELEMENT Domain (EMPTY) > 

<!ATTLIST Domain domainmame CDATA#REQUIRED> 

< [ELEMENT Role (EMPTY) > 

<[ATTLIST Role rolemame CDATA#REQUIRED> 

< [ELEMENT Priority (EMPTY) > 

<[ATTLIST Priority pri_value CDATA#REQUIRED> 

<[ELEMENT Context (EMPTY)> 

<[ATTLIST Context cont.value CDATA#REQUIRED> 

< [ELEMENT XQUERYjstmt (#PCDATA)> 

< [ELEMENT Action(SOAPJdeader, SOAP .Envelope, SOAP _Body)> 

< [ELEMENT SoapJJeader(#PCDATA)> 

< [ELEMENT Soap JEnvelope(#PCDATA) > 

<[ELEMENT Soap.Body(Oper)> 

< [ELEMENT oper ("INVALID ACTION” )> 

<[ATTLIST oper opermame CDATA#REQUIRED> 

The event part of the rule specifies the event responsible for rule triggering, 
which is enclosed into XML event tags. The time part of the rule defines the tem- 
poral events. XQuery [13] along with XPath [12] is used to specify the condition 
part of the rule. The action part of the Authorization Rule DTD is encapsulated 
using the Simple Object Access Protocol(SOAP) [7]. A SOAP message consists 
of three parts. The SOAP Envelope element is the root element of a SOAP mes- 
sage. It defines the XML document as a SOAP message. The optional SOAP 
Header element contains application specific information (like authentication, 
etc.,) about the SOAP message. If the Header element is present, it must be the 
first child element of the Envelope element. The SOAP Body contains the action 
data that is to be transmitted to the client. 



2.3 Audit Record 

Auditing is the monitoring and recording of selected user database actions. Au- 
diting is used to investigate suspicious activity. There are three standard types 
of auditing to monitor the user behavior namely SQL statement-level, privilege- 
level and object-level auditing. Statement and privilege audit options are in effect 
at the time a database user connects to the database and remain in effect for 
the duration of the session. In contrast, changes to object audit options become 
effective for current sessions immediately [4]. So in this work, we have chosen 
object-level auditing to build profiles. The object-level auditing can be done 
by user on successful or non-successful attempts for session intervals. A session 
is the time between when a user connects to and disconnects from a database 
object. We need an utility to capture the submitted database transactions in 
order to compare them with those in the legitimate user profile. Oracle provides 
the sqLtrace [6] utility that can be used to trace all database operations in a 
database session of an user. We make use of its capability to log SQL transactions 
executed by the database engine. 



Intelligent Multi-agent Based Database Hybrid Intrusion Prevention System 397 



The following attributes are included in each audit trail record: 

• User ID 

• Group ID 

• Process ID 

• Session ID 

• Host ID & IP address 

• Object ID 

• Event: It describes the type of transaction performed or attempted on a 
particular data object. 

• Completion: It describes the result of an attempted operation. A successful 
operation returns the value zero, and unsuccessful operations returns the 
error code describing the reason for the failure. 

• Transaction Time: The time at which the events or state changes are 
registered in the computer system. The total value of the user session is 
calculated based on this transaction time. 

• Valid Time: It has two values namely start Time and endTime represent- 
ing the interval during which the tuple in the audit record is valid. The 
prediction values that are generated by our ‘statistical prediction engine’, 
past observations and the on-line behavior of user are stipulated using valid 
time attribute. 

The following metrics are considered to audit the user behavior: 

• Audit the frequency of certain commands execution 
(Command Stroke Rate) by an user on an object in a session 

• Audit Execution Denials or Access Violations on an object in a session 

• Audit the Object utilization by an user for certain period 

• Audit the overt requests for a data object in a session 



3 Prediction Algorithm 

Quickprop NN ensemble forecasting model makes periodic short-term forecasts, 
since long-term forecasts cannot accurately predict an intrusion [2] . In this we use 
a multivariate time series technique to forecast the hacker’s behavior effectively. 
This algorithm consists of two phases: determination of the number of neurons 
in hidden layer(s) and construction of a Quickprop forecaster. The determined 
input patterns are then used to construct the Quickprop forecaster. 

A rule of thumb, known as the Baum-Haussler rule, is used to determine the 
number of hidden neurons to be used: 

jij ^ ^train^tolerance 

hidden _ jY jy “ 

■‘■'‘pts I ■‘■^output 

where N^idden is the number of hidden neurons, Ntrain is the number of training 
examples, Etoierance is the error tolerance, Npts is the number of data points per 
training example, and Noutput is the number of output neurons. 
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Neural network ensemble is a learning paradigm where several neural 
networks are jointly used to solve the same problem. The networks with the 
highest accuracy was considered for the ensemble members. The purpose of the 
ensemble model is to reduce variance, or instability of the neural network. It 
is a weighted average combination of the individual NN outputs, which finds 
weight for each individual network output in order to minimize mean absolute 
percentage error(MAPE) of the ensemble. The ensemble weights are determined 
as a function of the relative error of each network determined in training. The 
generalized ensemble output is defined by: 

n 

GEM = (1) 

i=l 

where Wi’s {E Wi = 1) are chosen to minimize the MATE with respect to the 
target function (estimated using the prediction set). The optimal weight for Wi 
is given by: 



n n n 

i=i k=ij=i 

where, Cij is the correlation matrix = expected value of [ei{x)ej{x)], 6i(x) is 
the error of the network fi = f{x) — fi{x), f(x) = true data. In practice, errors 
are often highly correlated. Thus rows of C is nearly linearly dependent, so that 
inverting C can lead to serious round off errors. To avoid this one could exclude 
the networks whose errors are highly correlated with each other. The total error 
of neural network training reflects partly the fitting to the regularities of the 
data and partly the fitting to the noise in the data. Ensemble averaging tends 
to filter the noise part as it varies amongst the ensemble members, and tends to 
retain the fitting to the regularities of the data, therefore, decreasing the overall 
error in the model. 

In order to carry out a prediction, a d:n:n:l four layer feed-forward Quick- 
prop neural network(d input units, n hidden units, and a single unit) has been 
considered. For our study, Quickprop networks of the commercially available 
artificial neural network simulator JavaNNS 1.1 [10] was used. 

4 Architecture 

The general architectural framework for a Multi- Agent based database statistical 
anomaly prediction system is illustrated in Fig. 1. It has been implemented 
by using Aglets Software Development Kit(ASDK) [8], and API Java Aglet(J- 
AAPI) developed by IBM Tokyo Research Laboratory. In this architecture, two 
kinds of agents are considered: They are 1. Information Agent 2. Host Agent. 

4.1 Information Agent 

Information Agent(Static Agent) acts as a data processing unit, and as a data 
repository for the Host Agents. It is responsible for collecting and storing user 
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Fig. 1. Intelligent Multi- Agent based Database Statistical Anomaly Prediction System 



profiles for all users from various agents in a timely fashion that has access to 
the data in the protected network. Also it provides the user profile to the Host 
Agent whenever it is requested. The Information Agent comprises of three main 
components namely 1. Host Monitor 2. XML Audit Profile Server 3. Admin 
Interface 

• Host Monitor(Mobile Agent): In distributed environment, the perfor- 
mance of each host has to be monitored constantly so that performance 
drop or failure of any node can be detected. Based on that corrective mea- 
sures can be taken to maintain the overall performance level of the network. 
When an Information Agent is created, it sends a monitor agent to every 
host in the network. The monitor agent then starts monitoring the perfor- 
mance as soon as it reaches the host at regular intervals and this interval 
can be programmed. 

• XML Audit Profile Server: Audit records are written into the XML for- 
mat and they are stored in XML Audit Profile Server. The generation and 
insertion of an audit trail record is independent of an user’s transaction. 
Therefore, even if an user’s transaction is rolled back, the audit trail record 
remains committed. The Profile Server must be able to provide the user 
behavior information about past, present and future and it must allow fore- 
casting based on temporal logic. So, in this work we have chosen an audit 
record database, which maintains past, present and future data about users 
and is termed as a temporal database. Authorization rules are installed in 
the XML repository and these rules monitor the XML databases for the oc- 
currence of events by the construction of event listeners on each node of the 
XML document. 

• Admin Interface: Interface agent(Static Agent) provides friendly human- 
computer interface for system administrator and it can provide information 
for administrator in the form of GUI and receive control commands from the 
GUI. 
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4.2 Brain 

Brain(Static Agent) is the central component of the agent and it initiates, con- 
trols, coordinates and integrates the activities of all the components of both 
Information Agent and Host Agent. 



4.3 Host Agent 

A Host Agent resides on every host on the protected distributed database en- 
vironment. It can be split into three basic intelligent agents such as Auditing 
Monitor, Knowledge Manager and Actioner. 

• Auditing Monitor: This static agent monitors every user who logs into 
the system. The database objects, privileges of the current users on the host 
machine are logged and send to the information agent. 

• Temporal Forecasting Engine: It is responsible for processing the moni- 
tored data from the Host Agent and generates forecasting data for the next 
session of the specific user. We move the forecasting module from the Infor- 
mation Agent to the Host Agent and repeat the experiments to discover if 
there is any difference in the results, and to maximize the intrusion forbidden 
system performance, so the agents will be distributing not only the security, 
but also the workload of the processing requirements of the Information 
Agent. 

• Intelligent JESS Inference Engine: When a transaction like insertion 
of element, deletion of element or updating of element happens at the XML 
file, the ETCA rule given in XML format is mapped to JESS authorization 
rule, which is predefined by database administrator. The events that occur 
are also converted to JESS facts. The JESS Inference engine constantly mon- 
itors the JESS rule and on the occurrence of a new fact, executes the JESS 
rule producing new JESS facts as the result. These JESS facts must later 
be converted into suitable action and it’s transmitted to the client. It checks 
pre-defined rules in order to detect database anomalies caused by successful 
attacks. Users are then granted access privileges to only the system con- 
taining data, which they have been authorized via a JESS Rule Execution 
Engine. If the JESS Inference engine fires the rule that gets reflected in the 
database dynamically. This is happened dynamically when the client is still 
in the transaction. 

• Knowledge Manager: This mobile agent gets the appropriate profile for 
the specific user, which is stored locally on the XML Audit Profile Server, and 
then it compares the user’s historical profile with the information sent by the 
Host Monitor. The knowledge manager makes comparison constantly. If the 
current behavior profile does not match with the normal behavior pattern 
defined by the user historical profile, then the Knowledge Manager provides 
the following information to the Actioner: user identifier, session identifier, 
host identifier & IP address and the invalid privilege with the corresponding 
unauthorized object attempting to be accessed. 
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• Actioner: Actioner’s(Static Agent) role is to take necessary actions when 
an intrusion is detected. It also uses the prediction data for a user, to take 
preemptive actions on the user behavior. When an attack is detected exactly, 
the Actioner does one of the following operations to terminate the attack: 1. 
Reject the user’s attempt with the warning message 2. Terminate the specific 
operation on the particular database object 3. Lock the user’s keyboard and 
prevent the user from consuming any further data resources 4. Reports an 
intrusion detected on a host to the system administrator via the Information 
Agent. In Actioner, the action element is put in the SOAP component. Then 
the SOAP header and SOAP envelope are constructed over it by putting the 
endpoint of where the data has to be sent. The output of this component 
is SOAP message, which is then appended to the action part of the rule. 
JAXM(Java API for XML Messaging) is a package that is used to send the 
SOAP message across different clients. The result should be presented in the 
user readable/understandable form. For example, XML documents can be 
presented as HTML pages with XSLT style sheets. 

• Profile Reader: This mobile agent is responsible for fetching the on-line 
data from the auditing monitor and the predicted values from the temporal 
forecasting engine. Then it sends this information to the Information agent. 

• Rule Generator: This static agent is assigned the task of rule creation 
based on the request from the client and it is responsible for the submission 
of the rules to the XML Server. 

• Intelligent JESS Inference Engine: When a transaction like insertion 
of element, deletion of element or updating of element happens at the XML 
file, the ETCA rule given in XML format is mapped to JESS authorization 
rule, which is predefined by database administrator. The events that occur 
are also converted to JESS facts. The JESS Inference engine constantly mon- 
itors the JESS rule and on the occurrence of a new fact, executes the JESS 
rule producing new JESS facts as the result. These JESS facts must later 
be converted into suitable action and it’s transmitted to the client. It checks 
pre-defined rules in order to detect database anomalies caused by successful 
attacks. Users are then granted access privileges to only the system con- 
taining data which they have been authorized via a JESS Rule Execution 
Engine. If the JESS Inference engine fires the rule that gets reflected in the 
database dynamically. This is happened dynamically when the client is still 
in the transaction. 

• Client Interface: It is just an application dependent program that commu- 
nicates with the Client to get the client’s request and also provides the re- 
sponse to the client. In case of Stationary systems, the system needs to com- 
municate via a network using standard HTTP format. In order to support 
mobile users, this component converts the XML data into WML(Wireless 
Markup Language) [11] data and a WAP(Wireless Application Protocol) [11] 
is used to transfer this WML data to the mobile devices. 
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5 Experimental Results 

5.1 Model Development and Analysis of Data 

This study obtains a collection of audit data for normal transactions from a 
major corporate bank, Chennai, India. The database objects considered are (1) 
Customer Deposit Accounts(CDAcc) and (2) Customer Loan Accounts(CLAcc) 
as well as (3) Ledger Reports related to each transactions on the Customer 
Accounts(LRep). The database objects are used by Tellers(Tlr), Customer Ser- 
vice Reps(CSR) and Loan Officers(LO) to perform various transactions. It is 
also used by Accountants(Acc), Accounting Managers(AccMr) and Internal 
Auditors(IntAud) to post, generate and verify accounting data. Branch Man- 
ager(Bi'M) has the ability to perform any of the functions of other roles in times 
of emergency and to view all transactions, account statuses and validation flags. 
Normal transactions are generated by simulating activities observed in a cor- 
porate bank information system in an usual operation condition. A number of 
intrusions are also simulated in our laboratory, including password guessing, to 
gain the root privilege, attempts to gain an unauthorized remote access, an over- 
whelming number of service requests can be sent to an information system over 
a short period of time to deplete the computational resource in the server and 
thus deny the server’s ability to respond to user’s service requests, etc., to create 
the audit data of intrusive activities [3]. 



5.2 Training and Testing Data 

We obtain 8 weeks of the December 2003 & January 2004 audit dataset from 
the corporate bank in our study [4]. We use the first part of the audit data 
for normal activities as our training dataset, and use the remaining audit data 
for normal activities and attack activities as our testing dataset. The first half 
of the audit data, consisting of 16,413 audit transactions lasting four weeks, is 
used as the training data. In the testing dataset, the average session length is 
comparatively smaller in week-2 and week-3 than that in week-1 and week-4. 
In terms of sessions, almost one-fifth of the sessions in week-2 and one-fifteenth 
of the sessions in week-3 are intrusion sessions. Week-1 contains mostly normal 
sessions, week-4 also does not have too many intrusion sessions. Week-1 and 
week-2 contain 12 and 16 normal sessions, week-2 and week-3 contain 6 and 
7 intrusion sessions. Hence, the testing data contains a total of 28,574 audit 
transactions with 3 segments of data in the sequence: 1. 7,320 normal events(the 
first half of the 14,640 normal events) 2. 13,934 intrusive events 3. 7,320 normal 
events (the second half of the 14,640 normal events). 

5.3 Selection of NN Architecture 

The topology of a network architecture is very crucial to its performance but 
there is no easy way to determine the optimum number of hidden layers and 
neurons without training several networks. For example, if there are too many 
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neurons then it may result in a condition called ” over- fitting” which means that 
the network will perform well on the training set but it won’t generalize properly. 
In this case the network begins to learn the noise as well, whereas if there are 
relatively too few neurons then it may result in a condition termed as ’’under- 
fitting” that results in high training and generalization error [3]. In order to 
check the most appropriate parameters for prediction, we carried out a sweeping 
in the number of neurons of the hidden layer as an initial test. The learning 
rate value of 0.1 and the momentum factor value around 0.5 would produce the 
fastest learning for this problem. In this test, the weights were initialized for each 
of the networks with random values within the range [-0.5, 0.5] and the number 
of iterations that were carried out was 15,000. Bounding the weights can help 
prevent the network from becoming saturated. The weights and unit outputs 
were being turned off hard during the early stages of the learning, and they were 
getting stuck in the zero state. We altered the sigmoid-prime function so that 
it does not go to zero for any output value. We simply added a constant 0.1 
to the sigmoid prime value before using it to scale the error. This modification 
made a dramatic difference, cutting the learning time almost in half. In this 
way, various network architectures with different number of hidden layers and 
neurons in each layer was investigated. The network architectures used for this 
study are described in Table. 1. 



Table 1. Ensemble members with different network architectures. HLiNumber of Hid- 
den Layer; HL l:Number of Nodes, Activation function - Hidden Layer(l); HL 2: Num- 
ber of Nodes, Activation function - Hidden Layer(2); TT: Training time 



Model 


HL 


HL 1 


HL 2 


TT 


I 


1 


10 


- 


748 


II 


1 


20 


- 


1561 


HI 


1 


10,tanh 


- 


821 


IV 


2 


12,tanh 


10,tanh 


1761 


V 


2 


lOjgaussian 


12,gaussian 


1899 



It can be observed that the successive increase in the number of neurons in the 
hidden layer hardly diminishes the training error, and also that the validation 
error increases considerably. The training time it is approximately lineal and 
depends on the number of neurons of the hidden layer. 

5.4 Experimental Topology 

Fig. 2 shows a piece of the actual and predicted behavior curve for user Cus- 
tomer _Service_Reps(CSR), using Quickprop as a forecasting model. Fig. 2 shows 
the actual values for real observations and the associated single-step forecasts. 
The X-axis specifies the real observations with our forecasting results and the 
Y-axis defines the usage of an object for an hour. For example, 0.2 means that 
the user has used that particular object for 12 minutes, in a specific hour of the 
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Real Values vs Predictions 




Fig. 2. Real values versus predicted values of the user 



object usage. We present a comparative graph(Fig. 2) of the data resource con- 
sumption by CSR for 100 validation patterns, which gives the real values versus 
the predicted values. 

5.5 Validating the Training 

In this test we tried to analyze the effect of the number of iterations of the 
learning algorithm. A Quickprop neural network ensemble was used for this 
test. 

Fig. 3 shows the learning epoch as well as the training and test errors. In 
Fig. 3 one can see the change of the MAPE error over time for the training of 
the network using 15,000 epochs. What is noticeable is the sharp decrease in 
both the errors for the first 1,200 epochs. Thereafter the errors decrease at a 
slower rate. We observed that the number of iterations has less influence on the 
obtained error than the number of neurons in the hidden layer. Since although 
the training error could be appreciably diminished when increasing the number 
of iterations, the same does not happen in the validation phase, in which the 
error remains more stable throughout the experiment. 

5.6 Rule Processing Subsystem 

A rule processing subsystem diagram for the two important process rule gener- 
ation and transmission, rule triggering and action delivery are shown in Fig. 4 
and Fig. 5 respectively. 
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Fig. 3. Comparison of the errors associated with the Quickprop Neural Network En- 
semble 
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Fig. 4. Rule Generation 



5.7 Performance Analysis Results and Discussion 

In order to measure the error made by the neural network model, a 
widely accepted quantitative measure, such as Mean Absolute Percentage Er- 
ror (M APE) has been used. The performance analysis of the Quickprop fore- 
caster was measured in terms of Mean Absolute Percentage Error (MAPE)= 
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Fig. 5. Rule Triggering and Action Delivery 



^ I — ° I where N = Number of observations. MAPE will 
Targeti 

be employed in this paper as the performance criterion, for it’s easy understand- 
ing and simple mathematical computation. The mean absolute percentage error 
determines the mean percentage deviation of the predicted outputs from the tar- 
get outputs. This absolute deviation places greater emphasis on errors occurring 
with small target values as opposed to those of larger target values. The MAPE 
for all the above Quickprop network architectures is summarized in Table 2. 



Table 2. Error Measurement 



Models 


TIr 


GSR 


LO 


Acc 


AccMr 


IntAud 


Avg. of MAPE(%) 


Model I 


5.1448 


7.6308 


0.1392 


7.8316 


18.8750 


48.4673 


14.6814 


Model II 


1.9747 


3.9965 


0.0713 


4.1780 


9.1140 


26.5326 


7.64452 


Model III 


3.5841 


5.0164 


0.0912 


5.5522 


14.4856 


37.2338 


10.9939 


Model IV 


2.0667 


4.9716 


0.0873 


5.2297 


10.5408 


30.3359 


8.8720 


Model V 


4.2786 


5.7335 


0.1208 


6.0443 


15.5016 


45.7680 


12.9078 



The generalized ensemble method(GEM) assigned weights to the individual 
models depending upon their relative performance which resulted in the reduc- 
tion of the MAPE as shown in Table 2. If the MAPE between two models had 
been low, then one could use an ensemble of the two models to obtain an output 
that adequately represents the entire variability of the true data. In this study, 
the least MAPE for models II and IV. So when these two models were consid- 
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Table 3. Error Measurement 



Model 


Tlr 


CSR 


LO 


Acc 


AccMr 


IntAud 


Avg. of MAPE(%) 


GEM(II,IV) 


1.2812 


2.59296 


0.04626 


2.71072 


5.91323 


17.2145 


4.95982 



ered as the ensemble members, and assigned the appropriate weights according 
to Equation. 2; it decreased the MAPE values(Table. 3). A slight degradation in 
the validation error for the proposed network can be observed. 

The examination on the performance of the Quickprop ensemble technique 
with different combination of data sets leads to following findings: 

• Users Tlr & CSR present a more uniform behavior. Because of this, all models 
achieve better performance with users Tlr & CSR as shown in Table. 2. This 
is specially noticeable for prediction models, because user CSR is much less 
affected by hour off’s. 

• With respect to the other users, it can be seen that prediction models achieve 
better accuracy for user LO. Though, he has a large amount of day off’s and 
hour off’s, the false alarm rate achieved by all prediction models is minimal 
due to the strong presence of weekly periodicity in that user. 

• User Acc is an intermediate case. He has daily periodicity rated as high, but 
has large amount of day off’s. This fact strongly penalizes the prediction 
models, that cannot achieve accuracy satisfactorily. 

• Finally, it is worth adding that the error achieved by all prediction models 
is larger for AccMr, IntAud & BrM, because the mixed periodicities present 
in that users, but most important, they have a large amount of days off, 
that poses a difficulty for the models to find the relations between input and 
output variables. In fact, the least accurate predictions are obtained for this 
users, as shown in Table. 2. 

6 Conclusions and Future Works 

In this paper, an Intelligent Multi-Agent based distributed database intrusion 
prevention system has been presented to learn previously observed user behavior 
in order to prevent future intrusions in database systems. Quickprop NN is in- 
vestigated as a tool to predict database intrusions and several NN architectures 
were explored by varying the parameters such as the number of neurons, number 
of hidden layers and activation functions. The ensemble method gave improved 
results i.e., decreased MAPE as compared to any individual NN model. The 
TARML has been designed in misuse prevention system and it can suit for any 
kind of real domain and any platform. For future expansion fuzzy rules can be 
extended with JESS, so that the intrusion prevention system can be made much 
more effective. Thus, the system is developed to demonstrate the use of intelli- 
gent agents for auditing the transactions within the organization, detecting po- 
tential risks, and avoiding uncontrollable transactions. Hence for the future we’ll 
incorporate other NN models and ensemble techniques further to improve the 
predictability of database intrusion and reduce the rate of false negative alarms. 



408 



P. Ramasubramanian and A. Kannan 



References 

1. Chung, C.Y., Gertz, M., Levitt, K.: Misuse detection in database systems through 
user profiling. In Web Proceedings of the 2nd International Workshop on the Recent 
Advances in Intrusion Detection(RAID). (1999) 278. 

2. Pikoulas, J., Buchanan, W.J., Manion, M., Triantafyllopoulos, K.: An intelligent 
agent intrusion system. In Proceedings of the 9th IEEE International Conference 
and Workshop on the Engineering of Computer Based Systems - ECBS, IEEE 
Comput. Soc., Luden, Sweden. (2002) 94-102. 

3. Ramasubramanian, P., Kannan, A.: Quickprop Neural Network Short-Term Fore- 
casting Framework for a Database Intrusion Prediction System. In Proceedings of 
the Seventh International Conference on Artihcial Intelligence and Soft Computing 
(ICAISC 2004) June 7-11, 2004 at Zakopane, Poland, Lecture Notes in Computer 
Science, Vol. 3070, Springer- Verlag ISBN: 3-540-22123-9. (2004) 847-852. 

4. Ramasubramanian, P., Kannan, A.: Multivariate Statistical Short-Term Hybrid 
Prediction Modeling for Database Anomaly Intrusion Prediction System. In Pro- 
ceedings of the Second International Conference on Applied Cryptography and 
Network Security (ACNS 2004) June 8-11, 2004 at Yellow Mountain, China, Lec- 
ture Notes in Computer Science, Vol. 3089, Springer- Verlag ISBN: 3-540-22217-0. 
(2004) 

5. Ramasubramanian, P., Kannan, A.: An Active Rule Based Approach to Database 
Security in E-Commerce Systems using Temporal Constraints. In Proceedings of 
IEEE Tencon 2003, October 14-17, 2003 at Bangalore, India. 1148-1152. 

6. Sin Yeung Lee, Wai Lup Low and Pei Yuen Wong.: Learning Fingerprints For A 
Database Intrusion Detection System. In Proceedings of the 7th European Sym- 
posium on Research in Computer Security, Zurich, Switzerland. (2002) 264-280. 

7. Simple Object Access Protocol (SOAP) 1.1. Available at URL 
http://www.w3.org/TR/2000/SOAP (2004) 

8. Java Aglet, IBM Tokyo Research Laboratory. Available at URL 
http://www.trl.ibm.co.jp/aglets (2004) 

9. Triantafyllopoulos, K., Pikoulas, J.: Multivariate Bayesian regression applied to 
the problem of network security. Journal of Forecasting. 21 (2002) 579-594. 

10. Java Neural Network Simulator 1.1. Available at URL 
http://www-ra.informatik.uni-tuebingen.de/downloads/JavaNNS (2004) 

11. WAP specifications and WAP gateway. Available at URL 
http://www.wapforum.org, http://www.wapgateway.org (2004) 

12. XML Path Language (XPath) 1.0. Available at URL 
http://www.w3.org/TR/XPATH (2004) 

13. XQuery 1.0: An XML Query Language. Available at URL 
http://www.w3.org/TR/XQuery (2004) 



Energy Efficient Transaction Processing in 
Mobile Broadcast Environments* 



SangKeim Lee 

Department of Computer Science and Engineering, 
Korea University, Seoul, South Korea 
yalphyOkorea . ac . kr 



Abstract. Broadcasting in wireless mobile computing environments is 
an effective technique to disseminate information to a massive number of 
clients equipped with powerful, battery operated devices. To conserve the 
usage of energy, which is scarce resource, the information to be broadcast 
must be organized so that the client can selectively tune in at the desired 
portion of the broadcast. In this paper, the energy efficient behavior of 
a predeclaration-based transaction processing in mobile broadcast en- 
vironments is examined. The analytical studies have been performed to 
evaluate the effectiveness of our method. The analysis has shown that our 
predeclaration-based transaction processing with selective tuning ability 
can provide significant performance improvement of battery life, while 
retaining a low access time, in mobile broadcast environments. 



1 Introduction 

With the advent of third generation wireless infrastructure and the rapid growth 
of wireless communication technology such as Bluetooth and IEEE 802.11, mo- 
bile computing becomes possible. People with battery powered mobile devices 
can access various kinds of services at any time any place. However, existing 
wireless services are limited by the constraints of mobile environments such as 
narrow bandwidth, frequent disconnections, and limitations of the battery tech- 
nology. Thus, mechanisms to efficiently transmit information from the server to 
a massive number of clients have received considerable attention [1], [2], [7]. 

Wireless broadcasting is an attractive approach for data dissemination in 
a mobile environment. Disseminating data through a broadcast channel allows 
simultaneous access by an arbitrary number of mobile users and thus allows 
efficient usage of scarce bandwidth. Due to this scalability feature, the wireless 
broadcast channel has been considered an alternative storage medium of the 
traditional hard disks [1], [7]. Such applications as using palmtops to access 
airline schedules, stock activities, traffic conditions, and weather information on 
the road are expected to become increasingly popular. It is noted, however, that 
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several mobile computers, such as desktops and palmtops, use batteries of limited 
lifetime for their operations and are not directly connected to any power source. 
As a result, energy efficiency is a very important issue to resolve before we can 
anticipate an even wider acceptability for mobile computers [6], [11], [13]. 

Among others, one viable approach to energy efficiency is to use indexed 
data organization to broadcast data over wireless channels to mobile clients. 
Without any auxiliary information on the broadcast channel, a client may have 
to access all objects in a broadcast cycle in order to retrieve the desired data. This 
requires the client to listen to the broadcast channel all the time, which is power 
inefficient. Air indexing techniques address this issue by pre-computing some 
index information and interleaving it with the data on the broadcast channel, 
and many studies appear in the literature [4] , [5] , [7] , [14] . By first accessing the 
broadcast index, the mobile client is able to predict the arrival time of the desired 
data. Thus, it can stay in the power saving mode most of the time, and tune 
into the broadcast channel only when the requested data arrives. The drawback 
of this solution is that broadcast cycles are lengthened due to additional index 
information. As such, there is a trade-off between access time and tuning time. 
In mobile broadcast environments, the following two parameters are of concern: 

— Access Time: the period of time elapsed from the moment a mobile client 
issues a query to the moment when the requested data items are received by 
the client. 

— Tuning Time: the period of time spent by a mobile client staying active in 
order to retrieve the requested data items. 

While access time measures the overhead of an index structure and the ef- 
ficiency of data and index organization on the broadcast channel, tuning time 
is frequently used to estimate the power consumption by a mobile client since 
sending/receiving data is power dominant in a mobile environment [8]. 

In this paper, we consider the issue of energy efficient transaction processing, 
where multiple data items are involved, in mobile broadcast environments. To 
the best of our knowledge, power conservation in the context of transaction pro- 
cessing has not been addressed before. In our previous work [9] a predeclaration- 
based query optimization was explored for efficient (in terms of access time) 
processing of wireless read-only transactions in mobile broadcast environments. 
There, clients are just tuning in broadcast channel and waiting for the data 
of interests. This paper extends our previous work to analyze the energy effi- 
cient behavior of predeclaration-based transaction processing in various types of 
indexed data organizations. 

The remainder of this paper is organized as follows. Section 2 describes 
the background of our system model and indexed data organizations. Section 
3 presents the proposed access method in the context of predeclaration-based 
transaction processing. Section 4 develops analytical models to examine the ef- 
fectiveness of the proposed scheme, and Section 5 reports the access and tuning 
time of our transaction processing scheme in various indexed data organizations. 
The conclusion of the paper is in Section 6. 
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2 Preliminaries 

2.1 Basics of Wireless Broadcasting 

We here briefly describe the model of a mobile broadcast system, which is similar 
to the models in [1], [7]. The system consists of a data server and a number of 
mobile clients connected to the server through a low bandwidth wireless network. 
A server maintains the consistency of a database and reflects refreshment by up- 
date transactions being issued only on the server side. The correctness criterion 
in transaction processing adopted in this paper is serializability [3], which has 
been proven to be not expensive to achieve in the work [9]. The server broadcasts 
data items in the database periodically to a number of clients, on a communica- 
tion channel which is assumed to have broadcasting capability. Clients will only 
receive the broadcast data and fetch individual items (identified by a key) from 
the broadcast channel. However, updates to the items are reflected only between 
successive broadcasts. Hence, the content of the current version of the broadcast 
is completely determined before the start of broadcast of that version. 

In our model, Altering is by simple pattern matching of the primary key. 
Clients will remain in doze mode most of the time and tune in periodically to 
the broadcast channel, in order to download the required data. Selective tuning 
will require that the server, in addition to broadcasting the data, also broadcast 
index information that indicates the point of time in the broadcast channel when 
a particular data item is broadcast. The broadcast channel is the source of all 
information to the client including data as well as index. 

Each version of the database items along with the associated index informa- 
tion will constitute a beast, which will be organized as a sequence of buckets. A 
bucket is the smallest logical unit of a broadcast, and is a multiple of the size a 
packet. All buckets are of the same size. Both access time and tuning time will be 
measured in terms of number of data items with the assumption that, without 
loss of generality, the size of a data item is identical to the size of a bucket. Point- 
ers to specific buckets within the beast will be provided by specifying an offset 
from the bucket which holds pointer, to the bucket to which the pointer points 
to. The actual time of broadcast for such a bucket (from the current bucket) is 
the product of {offset — 1) and the time necessary to broadcast a bucket. 

It is assumed that each data item in the database appears once during one 
broadcast cycle, i.e. uniform broadcast [1]. We assume that the content of the 
broadcast at each cycle is guaranteed to be consistent. That is, the values of 
data items that are broadcast during each cycle correspond to the state of the 
database at the beginning of the cycle, i.e. the values produced by all transactions 
that have been committed by the beginning of the cycle. 

2.2 Data Organization on the Broadcast Channel 

In general, data organization techniques which seek optimum in two dimensional 
space of access and tuning time are of importance. Being interleaved with data, 
the index will provide a sequence of pointers which eventually lead to the required 
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Fig. 1. Data and Index Organization 

data. To interleave data and index on the wireless broadcast channel, Access-opt, 
Tune-opt, and (l,m) indexing techniques [7] are considered in this paper, which 
are illustrated in Figure 1. 

— Access-opt: This technique provides the best access time with a very large 
tuning time with respect to a single item. The best access time is obtained 
when no index is broadcast along with data items. The size of the entire 
broadcast is minimal in this way. Clients simply tune into the broadcast 
channel and filter all the data till the required data item is downloaded. 

— Tunc-opt This technique provides the best tuning time with a large access 
time with respect to a single item. The server broadcasts the index at the 
beginning of each beast. The client which needs the item with primary key 
K, tunes into the broadcast channel at the beginning of the next beast to get 
the index. It then follows the index pointers to the item with the required 
primary key. This method has got the worst access time because, clients have 
to wait till the beginning of the next broadcast even if the required data is 
just in front of them. 

— (l,rn.) indexing: In this method, the index is broadcast m times during a 
single broadcast cycle. The whole index is broadcast preceding every A 
fraction of the broadcast cycle. In order to reduce the tuning time, each index 
segment (i.e. the set of contiguous index buckets) and each data segment (i.e. 
the set of data buckets broadcast between successive index segments) contain 
a pointer pointing to the root of the next index. 
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In case of Tune-opt and (l,m) indexing, selective tuning is accomplished by 
multiplexing an index with the data items in the broadcast. The clients are 
only required to operate in active mode when probing for the address of the 
index, traversing the index and downloading the required data, while spending 
the waiting time in doze mode. Each entry of the index contains the pair (id, 
offset). Data bucket also has an offset that points to the next index. 

3 Energy Efficient Predeclaration-Based Transaction 
Processing 

In previous work [9], [10] we proposed three predeclaration-based transaction 
processing schemes in mobile broadcast environments, namely P (Predecla- 
ration), PA (Predeclaration with Autoprefetching) , and PA^ (PA / Asyn- 
chronous). The analysis-based and simulation-based studies showed that they 
are able to greatly improve the access time of read-only transaction processing. 
The central idea was to deploy the predeclaration technique in order to minimize 
the number of different broadcast cycles from which transactions retrieve data. 

In this work, method P is adopted as our basic transaction processing 
approach, and is extended to integrate selective tuning ability since, methods 
PA and PA^ work with client caching technique which is orthogonal to the 
issue in the paper. Prior to proceeding, the usefulness of predeclaration-based 
transaction processing is explained briefly in the following to help understand 
the basic behavior of predeclaration-based transaction processing. 

Predeclaration and its Usufulness: The uniform broadcast in Access-opt 
organization is illustrated in Figure 2, where the server broadcasts a set of data 
items do to d6 in one broadcast channel. Suppose that a client transaction 
program starts its execution: IF (d2 < 3) THEN read(dO) ELSE read(dl). To 
show that the order in which a transaction reads data affects the access time of 
the transaction, consider the traditional client transaction processing in Figure 
2-(a). Since both dO and dl precede d2 in the beast with respect to the client 
and access to data is strictly sequential, the transaction has to read d2 first 
and wait to read the value of dO or dl. Thus, the access time of the transaction 
is 11 in case d2 and dO are accessed, or 12 in case d2 and dl are accessed. If, 
however, all data items that will be accessed potentially by the transaction, i.e. 
{d0,dl,d2}, are predeclared in advance, a client can hold all necessary data 
items with a reduced response time of 6, which is illustrated in Figure 2-(b). 
Thus the use of predeclaration allows the necessary items to be retrieved in the 
order they are broadcast, rather than in the order the requests are issued. 

Method P has been presented with Access-opt data organization in [9]. 
That is, without any form of index, the client has to tune to the channel all the 
time during the whole process of filtering. For realistic applications, however, 
this may be unacceptable as it requires the client to be active for a long time, 
thereby consuming scarce battery resource. In the paper, we would rather 
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Fig. 2. Usefulness of Predeclaration-based Transaction Processing 



provide a selective tuning ability for P method, enabling the client to become 
active only when data of interest is being broadcast, in the context of Tune-opt 
and (l,m) indexing. In general, the access protocol for Tune-opt and (l,m) 
indexing involves the following steps: 

— Initial probe: The client tunes into the broadcast channel and determines 
when the next nearest index will be broadcast. This is done by reading 
offset to determine the address of the next nearest index segment. It then 
tunes into the power saving mode until the next index arrives. 

— Index search: The client searches the index. It follows a sequence of pointers 
(i.e. selectively tunes into the broadcast index) to locate the data of interest 
and find out when to tune into the broadcast channel to get the desired data. 
It waits for the arrival of the data in the power saving mode. 

— Data retrieval: The client tunes into the channel when the desired data ar- 
rives and downloads the data. 



3.1 Access Protocol for Multiple Data Items 

We first need to elaborate the access protocol for searching and retrieving mul- 
tiple data items effectively. In predeclaration-based transaction processing, all 
data items are predeclared prior to the actual processing and should be retrieved 
in the order they appear from the broadcast channel, the client is required to 
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Sequence of pointers for <15 : iO -> i2 -> i? 
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Tuning sequence fordi and (15 ; iO •> i2*> i5 •> i7 -> dl -> d5 




Fig. 3. Access Protocol for Multiple Data Items in Tune^opt 



predict the arrival time of items by exploiting index information shown in the 
air. This can be done as follows: since the index provides a sequence of pointers 
which eventually lead to the single required item, with the index information the 
client is able to sort all the pointers, which constitute the index information for 
multiple data items of interest, in the order they appear on the channel. This 
would result in a long sequence of multiplexed pointers. This idea is illustrated 
at the bottom of Figure 3, where a client transaction requires two data items, 
dl and d5, in the order with the current position d4. It is observed that some 
portion of index information, i.e. iO and i2 in this specific example, needs to be 
visited only once to retrieve the two data items. This illustrates the reduction 
of tuning time which may be possible in our predeclaration-based transaction 
processing, compared to a straightforward approach where individual, separate 
initial probe and index search process are performed for each data item. 

3.2 Method P with Selective Tuning 

Now, we describe the behavior of method P to achieve improvement of tun- 
ing time, while retaining a low access time, with Tune-opt or (l,m) indexing 
techniques in mind. Let us define the predeclared readset of a transaction T, 
denoted by Pre.RS{T), to be a set of data items that T reads potentially. Each 
client processes T in three phases: {\) Preparation phase: it gets Pre-RS{T) and 
constructs a sequence of multiplexed pointers to all the items in Pre-RS(T), 
(2) Acquisition phase: it acquires data items in Pre-RS(T) from the periodic 
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broadcast. During this phase, a client additionally maintains a set Acquire{T) 
of all data items that it has acquired so far, and (3)Delivery phase: it delivers 
data items to its transaction according to the order in which the transaction 
requires data. 

Particularly, to construct a sequence of multiplexed pointers to all the items in 
Pre-RS{T), in this paper the client is required to read the index at the beginning 
of the next broadcast cycle, instead of the next nearest index, irrespective of 
TunC-opt or (1, m) indexing techniques. This is mainly because, the initial probe 
step is made to be consistent with the basic behavior of method P, in which the 
acquisition phase starts at the beginning of next broadcast cycle due to the ease 
of consistency maintenance [9], [10]. 

After obtaining the address of the next broadcast cycle in the initial probe 
step, the client tunes in at the beginning of the next broadcast cycle and examines 
the index information which is broadcast by the server. On the basis of the index 
information, the client locally constructs a tuning sequence of pointers for all the 
items in Pre-RS(T). This tuning sequence is constructed by sorting pointers 
of interest in the order they appear on the channel. Based on the generated 
tuning sequence, the client performs both index search and data retrieval steps 
accordingly in order to access multiple data items. 

With respect to consistency issue, since the content of the broadcast at each 
cycle is guaranteed to be consistent, the execution of each read-only transaction 
is clearly serializable if a client can fetch all data items within a single broadcast 
cycle. Now that all data items for its transaction are already identified and a 
sequence of multiplexed pointers to the data items is constructed, the client is 
able to complete the acquisition phase within a single broadcast cycle. 

More specifically, a client processes its transaction Ti as follows: 

1. On receiving Begin{Ti) { 

get Pre-RS{Ti) by using preprocessor; 

Acquire{Ti) = 0; 

tune into the current bucket on the broadcast channel; 

read the offset to determine the address of the next broadcast cycle; 

go into doze mode and tune in at the beginning of the next broadcast cycle; 

from the index segment construct a sequence of multiplexed pointers 

by sorting a sequence of pointers to individual items in Pre-RS{Ti)\ 

} 

2. While {Pre-RS{Ti) Acquire{Ti)) { 

for dj in Pre.RS{Ti) { 

according to the sequence of multiplexed pointers tune in when dj is 
broadcast and download dj] 
put dj into local storage; 

Acquire{Ti) 4= dj] 

} 

} 

3. Deliver data items to Ti according to the order in which R requires, 
and then commit Ti. 
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Table 1. Symbols and Their Meaning 



Symbol 


Meaning 


D 


num. of items in the database 


I 


the size of index in terms of buckets (i.e. items) in the index tree 


n 


num. of {primary-key plus pointer)s a bucket can hold 


k 


num. of levels in the index tree 


m 


num. of times the index is broadcast during a broadcast cycle 


accesss 


avg. access time for accessing a single item 


accesst 


avg. access time for accessing multiple items in a given transaction 


tunca 


avg. tuning time for accessing a single item 


tunet 


avg. tuning time for accessing multiple items in a given transaction 


Op 


num. of items appearing on a transaction program 



Theorem 1. Method P generates serializable exeeution of read-only transac- 
tions if the server broadcasts only serializable data values in each broadcast cycle. 

Proof. It is straightforward from the fact that the data set read by each transac- 
tion is a subset of a single broadcast. 

□ 



4 Analysis 

In this section, we develop analytical models to examine average access time 
and timing time of predeclaration-based transaction processing. We will derive 
the basic equation that describes the expected average access time and tuning 
time, which is measured in number of data items broadcast by the server. In the 
following analysis, we preclude the possibility of client’s disconnections for the 
sake of simplicity. Note that, in wireless data broadcast, the performance of a 
single client read-only transaction for a given broadcast program is independent 
of the presence of other clients transactions. As a result, we will analyze the 
environment by considering only a single client. Furthermore, the performance 
of method P is totally immune to the update rate of data items in the database, 
since method P completes its acquisition phase within a single broadcast cycle 
without local caching technique. The symbols and their meaning used through- 
out the analysis are summarized in Table 1. 

4.1 Access _opt 

This organization provides the best access time with a very large tuning time. 

Access Time : For a database of size D items (or buckets), access s will, on 
an average, be half the time between successive broadcasts of the data items, 
access s = y. In method P, a transaction processing is divided into 3 phases: 
preparation, acquisition and delivery phase. If the time required by a client for 
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each of three phases is expressed as PT, AT and DT respectively, the access 
time can be formulated by, 

accesst = PT + AT + DT (1) 

PT will on average be half of one broadcast cycle and DT is trivial, thus 
Expression (1) can be reduced to 

accesst = — + AT (2) 

AT involves retrieving all the items in the predeclared readset in the order 
they appear on broadcast channel. The retrieval time for the first item is accessg 
itself. The retrieval time for the second item is half of the remaining beast size, 
and the retrieval time for the next item is in turn half of the remaining beast 
size, and so on. Thus, the expected AT for a transaction with Op predeclared 
items is, 

AT = f2ilyD (3) 

The expected average access time of method P is therefore computed as, 

1 1 

accesst = + (4) 

i=l 

Tuning Time : The average tuning time is equal to access time. This is because, 
the client has to be in active mode throughout the period of access. 

4.2 Tune opt 

This organization provides the best tuning time with a large access time with 
respect to a single item. The server broadcasts the index at the beginning of 
each beast. 

Access Time : Probe wait, i.e. the average duration for getting to the 
next index information, is , which corresponds to PT in method P. With 

the similar reasoning to AT in Access-opt, beast wait, i.e. the average duration 
from the point the index information relevant to the required transaction data 
items is encountered, to the point when the required items are downloaded, is 
'y2iLi{^yyD + 1), which again corresponds to AT in method P. Since the access 
time is the sum of probe wait (i.e. PT) and beast wait (i.e. AT), 

accesst = + E(^)X^^ + C (5) 

i=l 

Tuning Time : Average tuning time for accessing a single item, tunCs, is fc+ 1, 
where k is the number of levels in the multi-leveled index tree, and 1 for the 



Energy Efficient Transaction Processing in Mobile Broadcast Environments 419 



final probe to download the item. When the index tree is fully balanced, k = 
[log„(£>)] and / =l + n + n^ + h 

In a one-at-a-time access protocol fashion, average tuning time for accessing 
multiple items in a given transaction, tunct, is Op{k+l). However, in our method, 
multiple data items are retrieved with the help of a sequence of multiplexed 
pointers. Therefore, in method P, tunet, is Op(k + l) minus num. of shared index 
buekets re-visits throughout the whole levels. For example, the root level in the 
index tree needs to be visited only once for multiple items. Since the expected 
number of shared index buckets re-visits at level i is computed as max(0, (op— 
(num. of index buckets at level i))), the expected total number of shared index 
buckets re-visits at whole levels are 'max(0, (op — n®)). 

With the reasoning, the tuning time of method P is 

fc-i 

tunet = Op(k -I- 1) — max(0, (op — n*)) (6) 

i=0 



4.3 (l,m) indexing 

In this organization, the index information is broadcast m times during the 
broadcast of the database. 

Access Time : In general, the probe wait is \(I + ^) when the client 
tunes in at the broadcast of the next nearest index segment. In method P, 
however, the client tunes in at the beginning of next broadcast cycle, and hence, 
the probe wait is \(ml -I- D). With the similar reasoning to Tunc-opt, the beast 
wait is (iTiI -1- D). Since the access time is the sum of probe wait and 

beast wait, 

Op 

accesst = -(ml D) + '^(-y(mI D) (7) 

i=l 

Tuning Time : The average tuning time is equal to that in Tune^opt. This is 
because, the client follows the same access protocol throughout the period of 
access. 



5 Analytical Results and Practical Implications 

In the following, we show some analytical results from scenarios [7] . Since we have 
reported in [9] the much better access time of method P than other transaction 
processing techniques such as invalidation-based or multiversion-based techniques 
[12] in pure-push broadcast environments, we concentrate on good power con- 
servation, i.e. a low tuning time, while retaining a low access time in this paper. 
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Table 2. Access and Tuning Time Result 



Symbol 


Simple 


Scenario 1 


Scenario 2 


Scenario 3 


D 


243 


1000 


10000 


100000 


I 


121 


111 


1111 


11111 


n 


3 


10 


10 


10 


k 


5 


3 


4 


5 


m 


2 


3 


3 


3 


Op 


3 


6 


6 


6 


accesst in Access_opt 


335 


1485 


14850 


148500 


tunet in Access_opt 


335 


1485 


14850 


148500 


accesst in Tune_opt 


501 


1650 


16493 


164931 


tunet in Tune_opt 


16 


19 


25 


31 


accesst in (l,m) 


667 


1979 


19792 


197917 


tunet in (l,m) 


16 


19 


25 


31 



5.1 Some Scenarios and Results 

Table 2 illustrates the access and tuning time required by three indexing tech- 
niques for various parameter settings, with the emphasis on Scenario 1 to 3 for 
n = 10. The second column, a simple scenario, is described to exemplify different 
variants of index distribution in a database consisting of 243 data buckets with 
n = 3. The bottom six rows denote the access and tuning time in Access-opt, 
Tune-opt, (l,m) indexing respectively. As the table illustrates the tuning time 
in Tune-opt, (l,m) indexing is the same. The tuning time in Access-opt is very 
large and very much higher than the other two. Both Tune-opt and (1 ,to) in- 
dexing always perform better than Access-opt in terms of tuning time. The most 
interesting observation is that, P method with selective tuning ability shows the 
most desirable performance behavior in terms of access time and tuning time 
in Tune-opt indexing. This will be further explained in the following practical 
implications. 

5.2 Practical Implications 

Consider a broadcasting system that is similar to the quotrex system [7] where, a 
stock market information of size 16 x 10"' Bytes is being broadcast. The broadcast 
channel has a bandwidth of 10 Kbps. Let the bucket length be 128 bytes. Thus, 
there are 1250 buckets of data. Let n, the number of (primary-key plus pointer)’s 
that can fit in a bucket, be 25. The index size is 53 buckets. It takes around 0.1 
seconds to broadcast a single bucket and 125 seconds to broadcast the whole 
database (with no index). Let the clients be equipped with the Hobbit Chip 
(AT&T). The power consumption of the chip in doze mode is 50 fiW and the 
consumption in active mode is 250 mW. 

With Access-opt, the access time of transaction with size 6 predeclared items 
is 1856 buckets, i.e. 185.6 seconds. Tuning time is also 1856 buckets, i.e. the 
power consumption is 185.6 sec x 250 mW = 46.400 Joules. 
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With Tunc-opt, the access time of transaction with size 6 predeclared items 
is 1935 buckets, i.e. 193.5 seconds. Tuning time is 15 buckets, i.e. the power 
consumption is 0.1 sec x (15 x 250 + 1920 x 50 x 10“^) mW = 0.385 Joules. 

With (1, m) indexing, optimum m can be computed to be 5 according to the 
equation in [7]. The access time of transaction with size 6 predeclared items is 
2249 buckets, i.e. 224.9 seconds. The tuning time is 15 buckets, i.e. the power 
consumption is 0.1 sec x (15 x 250 + 2234 x 50 x 10“^) mW = 0.386 Joules. 

To sum up, in Tune-opt case, the energy consumed per transaction issue is 
120 times smaller than that of Access-opt. This is achieved by compromising on 
the access time which increases by just 4.25%, which is only marginal! Comparing 
with (l,m) indexing, the power consumption is almost the same. However, the 
access time improves to 86% of the access time in (l,m) indexing. 

This practical implications are very interesting in that, if we want a low 
access time and also low power consumption then we can use Tunc-opt indexing 
in the context of predeclaration-based transaction processing. 

6 Conclusion 

Energy efficiency is crucial to mobile computing. We have explored power-saving 
predeclaration-based transaction processing in mobile broadcast environments. 
In particular, we investigated an access protocol for multiple data items in 
Tunc-opt and (l,m) indexing techniques, and integrated it with predeclaration- 
based transaction processing. The preliminary analytical results demonstrated 
the advantages of this approach. Interestingly, the optimum solution for tun- 
ing time is well-suited to the predeclaration-based transaction processing. In 
the context of predeclaration-based transaction processing augmented with the 
proposed access protocol, Tunc-opt shows better performance behavior than no 
index data organization and (l,m) indexing. This is rather contradictory to the 
traditional belief that the optimum solution for tuning time is not practical since, 
it has (unacceptable) very large access time for a single item access. 

As a future work, we are investigating the extension of the proposed access 
protocol for client caching techniques. A client may find data of interest in its lo- 
cal cache and thus needs to access the broadcast channel for a smaller number of 
times. Effective usage of this feature can provide more performance improvement 
in terms of power consumption. 
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